2025-05-07T20:22:34.9461615Z Current runner version: '2.323.0' 2025-05-07T20:22:34.9469256Z Runner name: 'i-0efa96680de6b8d22' 2025-05-07T20:22:34.9470432Z Machine name: 'ip-10-0-51-101' 2025-05-07T20:22:34.9473264Z ##[group]GITHUB_TOKEN Permissions 2025-05-07T20:22:34.9475608Z Contents: read 2025-05-07T20:22:34.9476126Z Metadata: read 2025-05-07T20:22:34.9476620Z Packages: read 2025-05-07T20:22:34.9477118Z ##[endgroup] 2025-05-07T20:22:34.9479391Z Secret source: None 2025-05-07T20:22:34.9480571Z Prepare workflow directory 2025-05-07T20:22:35.0005237Z Prepare all required actions 2025-05-07T20:22:35.0043075Z Getting action download info 2025-05-07T20:22:35.2061210Z Download action repository 'actions/checkout@v4' (SHA:11bd71901bbe5b1630ceea73d27597364c9af683) 2025-05-07T20:22:35.4743066Z Download action repository 'actions/download-artifact@v4' (SHA:d3f86a106a0bac45b974a628896c90dbdf5c8093) 2025-05-07T20:22:35.8140803Z Download action repository 'pytorch/test-infra@main' (SHA:117fccdf5892ff9a958d2afb4b4b8b6e930d3187) 2025-05-07T20:22:37.4082357Z Getting action download info 2025-05-07T20:22:37.5150690Z Download action repository 'nick-fields/retry@3e91a01664abd3c5cd539100d10d33b9c5b68482' (SHA:3e91a01664abd3c5cd539100d10d33b9c5b68482) 2025-05-07T20:22:37.7098075Z Complete job name: test_and_publish_artifact (x86, linux.g5.4xlarge.nvidia.gpu, genai, 3.9, 12.8.0, 12.6.3, clang) 2025-05-07T20:22:37.7665927Z A job started hook has been configured by the self-hosted runner administrator 2025-05-07T20:22:37.7788576Z ##[group]Run '/home/ec2-user/runner-scripts/before_job.sh' 2025-05-07T20:22:37.7800953Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:22:37.7802004Z ##[endgroup] 2025-05-07T20:22:38.9156802Z Runner Type: linux.g5.4xlarge.nvidia.gpu 2025-05-07T20:22:38.9157257Z Instance Type: g5.4xlarge 2025-05-07T20:22:38.9157505Z AMI Name: unknown 2025-05-07T20:22:38.9197348Z AMI ID: ami-071226ecf16aa7d96 2025-05-07T20:22:44.3275432Z ##[group]Run actions/checkout@v4 2025-05-07T20:22:44.3275734Z with: 2025-05-07T20:22:44.3275968Z submodules: true 2025-05-07T20:22:44.3276199Z repository: pytorch/FBGEMM 2025-05-07T20:22:44.3276582Z token: *** 2025-05-07T20:22:44.3276781Z ssh-strict: true 2025-05-07T20:22:44.3276993Z ssh-user: git 2025-05-07T20:22:44.3277215Z persist-credentials: true 2025-05-07T20:22:44.3277460Z clean: true 2025-05-07T20:22:44.3277685Z sparse-checkout-cone-mode: true 2025-05-07T20:22:44.3277953Z fetch-depth: 1 2025-05-07T20:22:44.3278164Z fetch-tags: false 2025-05-07T20:22:44.3278377Z show-progress: true 2025-05-07T20:22:44.3278598Z lfs: false 2025-05-07T20:22:44.3278800Z set-safe-directory: true 2025-05-07T20:22:44.3279055Z env: 2025-05-07T20:22:44.3279262Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:22:44.3279567Z BUILD_ENV: build_binary 2025-05-07T20:22:44.3279831Z BUILD_TARGET: genai 2025-05-07T20:22:44.3280050Z BUILD_VARIANT: cuda 2025-05-07T20:22:44.3280307Z BUILD_CUDA_VERSION: 12.8.0 2025-05-07T20:22:44.3280556Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:22:44.3280795Z ##[endgroup] 2025-05-07T20:22:44.4474485Z Syncing repository: pytorch/FBGEMM 2025-05-07T20:22:44.4475632Z ##[group]Getting Git version info 2025-05-07T20:22:44.4476080Z Working directory is '/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM' 2025-05-07T20:22:44.4476694Z [command]/usr/bin/git version 2025-05-07T20:22:44.4476955Z git version 2.47.1 2025-05-07T20:22:44.4485353Z ##[endgroup] 2025-05-07T20:22:44.4499429Z Temporarily overriding HOME='/home/ec2-user/actions-runner/_work/_temp/32e5c665-7cf4-4445-941f-b58e67342ba4' before making global git config changes 2025-05-07T20:22:44.4500337Z Adding repository directory to the temporary git global config as a safe directory 2025-05-07T20:22:44.4513869Z [command]/usr/bin/git config --global --add safe.directory /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM 2025-05-07T20:22:44.4552860Z Deleting the contents of '/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM' 2025-05-07T20:22:44.4555737Z ##[group]Initializing the repository 2025-05-07T20:22:44.4560046Z [command]/usr/bin/git init /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM 2025-05-07T20:22:44.4602656Z hint: Using 'master' as the name for the initial branch. This default branch name 2025-05-07T20:22:44.4603409Z hint: is subject to change. To configure the initial branch name to use in all 2025-05-07T20:22:44.4603950Z hint: of your new repositories, which will suppress this warning, call: 2025-05-07T20:22:44.4604334Z hint: 2025-05-07T20:22:44.4604622Z hint: git config --global init.defaultBranch 2025-05-07T20:22:44.4604957Z hint: 2025-05-07T20:22:44.4605275Z hint: Names commonly chosen instead of 'master' are 'main', 'trunk' and 2025-05-07T20:22:44.4605822Z hint: 'development'. The just-created branch can be renamed via this command: 2025-05-07T20:22:44.4606237Z hint: 2025-05-07T20:22:44.4606452Z hint: git branch -m 2025-05-07T20:22:44.4606949Z Initialized empty Git repository in /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/ 2025-05-07T20:22:44.4614749Z [command]/usr/bin/git remote add origin https://github.com/pytorch/FBGEMM 2025-05-07T20:22:44.4648741Z ##[endgroup] 2025-05-07T20:22:44.4649228Z ##[group]Disabling automatic garbage collection 2025-05-07T20:22:44.4652521Z [command]/usr/bin/git config --local gc.auto 0 2025-05-07T20:22:44.4683598Z ##[endgroup] 2025-05-07T20:22:44.4684014Z ##[group]Setting up auth 2025-05-07T20:22:44.4690013Z [command]/usr/bin/git config --local --name-only --get-regexp core\.sshCommand 2025-05-07T20:22:44.4721363Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'core\.sshCommand' && git config --local --unset-all 'core.sshCommand' || :" 2025-05-07T20:22:44.5084009Z [command]/usr/bin/git config --local --name-only --get-regexp http\.https\:\/\/github\.com\/\.extraheader 2025-05-07T20:22:44.5115728Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'http\.https\:\/\/github\.com\/\.extraheader' && git config --local --unset-all 'http.https://github.com/.extraheader' || :" 2025-05-07T20:22:44.5461121Z [command]/usr/bin/git config --local http.https://github.com/.extraheader AUTHORIZATION: basic *** 2025-05-07T20:22:44.5511348Z ##[endgroup] 2025-05-07T20:22:44.5511772Z ##[group]Fetching the repository 2025-05-07T20:22:44.5519975Z [command]/usr/bin/git -c protocol.version=2 fetch --no-tags --prune --no-recurse-submodules --depth=1 origin +a2f4c52051596e74bc8c16e3d2867a4ecdd271e0:refs/remotes/pull/4066/merge 2025-05-07T20:22:45.3135755Z From https://github.com/pytorch/FBGEMM 2025-05-07T20:22:45.3136447Z * [new ref] a2f4c52051596e74bc8c16e3d2867a4ecdd271e0 -> pull/4066/merge 2025-05-07T20:22:45.3162246Z ##[endgroup] 2025-05-07T20:22:45.3162633Z ##[group]Determining the checkout info 2025-05-07T20:22:45.3165163Z ##[endgroup] 2025-05-07T20:22:45.3179797Z [command]/usr/bin/git sparse-checkout disable 2025-05-07T20:22:45.3219570Z [command]/usr/bin/git config --local --unset-all extensions.worktreeConfig 2025-05-07T20:22:45.3250344Z ##[group]Checking out the ref 2025-05-07T20:22:45.3254697Z [command]/usr/bin/git checkout --progress --force refs/remotes/pull/4066/merge 2025-05-07T20:22:45.4348571Z Note: switching to 'refs/remotes/pull/4066/merge'. 2025-05-07T20:22:45.4348854Z 2025-05-07T20:22:45.4349139Z You are in 'detached HEAD' state. You can look around, make experimental 2025-05-07T20:22:45.4349768Z changes and commit them, and you can discard any commits you make in this 2025-05-07T20:22:45.4350463Z state without impacting any branches by switching back to a branch. 2025-05-07T20:22:45.4350793Z 2025-05-07T20:22:45.4351014Z If you want to create a new branch to retain commits you create, you may 2025-05-07T20:22:45.4351505Z do so (now or later) by using -c with the switch command. Example: 2025-05-07T20:22:45.4351785Z 2025-05-07T20:22:45.4351904Z git switch -c 2025-05-07T20:22:45.4352109Z 2025-05-07T20:22:45.4352235Z Or undo this operation with: 2025-05-07T20:22:45.4352417Z 2025-05-07T20:22:45.4352523Z git switch - 2025-05-07T20:22:45.4352959Z 2025-05-07T20:22:45.4353202Z Turn off this advice by setting config variable advice.detachedHead to false 2025-05-07T20:22:45.4353556Z 2025-05-07T20:22:45.4353971Z HEAD is now at a2f4c52 Merge 6060cd4b5f971680caecdcc657faccb5720d1c3e into fd4df5f456e0cca514bacd98a39efb72990fd9f4 2025-05-07T20:22:45.4364038Z ##[endgroup] 2025-05-07T20:22:45.4364474Z ##[group]Setting up auth for fetching submodules 2025-05-07T20:22:45.4370799Z [command]/usr/bin/git config --global http.https://github.com/.extraheader AUTHORIZATION: basic *** 2025-05-07T20:22:45.4422979Z [command]/usr/bin/git config --global --unset-all url.https://github.com/.insteadOf 2025-05-07T20:22:45.4455497Z [command]/usr/bin/git config --global --add url.https://github.com/.insteadOf git@github.com: 2025-05-07T20:22:45.4487834Z [command]/usr/bin/git config --global --add url.https://github.com/.insteadOf org-21003710@github.com: 2025-05-07T20:22:45.4515717Z ##[endgroup] 2025-05-07T20:22:45.4516100Z ##[group]Fetching submodules 2025-05-07T20:22:45.4518988Z [command]/usr/bin/git submodule sync 2025-05-07T20:22:45.4861930Z [command]/usr/bin/git -c protocol.version=2 submodule update --init --force --depth=1 2025-05-07T20:22:45.5194949Z Submodule 'external/asmjit' (https://github.com/asmjit/asmjit.git) registered for path 'external/asmjit' 2025-05-07T20:22:45.5198080Z Submodule 'external/composable_kernel' (https://github.com/jwfromm/composable_kernel.git) registered for path 'external/composable_kernel' 2025-05-07T20:22:45.5201722Z Submodule 'external/cpuinfo' (https://github.com/pytorch/cpuinfo) registered for path 'external/cpuinfo' 2025-05-07T20:22:45.5205875Z Submodule 'external/cutlass' (https://github.com/jwfromm/cutlass) registered for path 'external/cutlass' 2025-05-07T20:22:45.5210044Z Submodule 'external/googletest' (https://github.com/google/googletest) registered for path 'external/googletest' 2025-05-07T20:22:45.5214410Z Submodule 'external/hipify_torch' (https://github.com/ROCmSoftwarePlatform/hipify_torch.git) registered for path 'external/hipify_torch' 2025-05-07T20:22:45.5218201Z Submodule 'external/json' (https://github.com/nlohmann/json.git) registered for path 'external/json' 2025-05-07T20:22:45.5249575Z Cloning into '/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/external/asmjit'... 2025-05-07T20:22:45.8713772Z Cloning into '/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/external/composable_kernel'... 2025-05-07T20:22:46.3279272Z Cloning into '/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/external/cpuinfo'... 2025-05-07T20:22:46.7141112Z Cloning into '/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/external/cutlass'... 2025-05-07T20:22:47.7129567Z Cloning into '/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/external/googletest'... 2025-05-07T20:22:48.0503845Z Cloning into '/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/external/hipify_torch'... 2025-05-07T20:22:48.2944684Z Cloning into '/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/external/json'... 2025-05-07T20:22:49.4688276Z From https://github.com/asmjit/asmjit 2025-05-07T20:22:49.4688829Z * branch e5d7c0bd5d9aec44d68830187138149e6a8c4e32 -> FETCH_HEAD 2025-05-07T20:22:49.5168297Z Submodule path 'external/asmjit': checked out 'e5d7c0bd5d9aec44d68830187138149e6a8c4e32' 2025-05-07T20:22:50.1628686Z From https://github.com/jwfromm/composable_kernel 2025-05-07T20:22:50.1629185Z * branch 4a61bdd4bd4ed730e078aebc7c0fcf046ff29406 -> FETCH_HEAD 2025-05-07T20:22:50.4328164Z Submodule path 'external/composable_kernel': checked out '4a61bdd4bd4ed730e078aebc7c0fcf046ff29406' 2025-05-07T20:22:51.4403327Z From https://github.com/pytorch/cpuinfo 2025-05-07T20:22:51.4403772Z * branch 6543fec09b2f04ac4a666882998b534afc9c1349 -> FETCH_HEAD 2025-05-07T20:22:51.5458187Z Submodule path 'external/cpuinfo': checked out '6543fec09b2f04ac4a666882998b534afc9c1349' 2025-05-07T20:22:52.5966120Z From https://github.com/jwfromm/cutlass 2025-05-07T20:22:52.5966593Z * branch 3ed8d2ec4ba35ef5d9d8353826209b6f868f63d3 -> FETCH_HEAD 2025-05-07T20:22:53.2929084Z Submodule path 'external/cutlass': checked out '3ed8d2ec4ba35ef5d9d8353826209b6f868f63d3' 2025-05-07T20:22:54.0626567Z From https://github.com/google/googletest 2025-05-07T20:22:54.0627026Z * branch f8d7d77c06936315286eb55f8de22cd23c188571 -> FETCH_HEAD 2025-05-07T20:22:54.1042745Z Submodule path 'external/googletest': checked out 'f8d7d77c06936315286eb55f8de22cd23c188571' 2025-05-07T20:22:54.7291567Z From https://github.com/ROCmSoftwarePlatform/hipify_torch 2025-05-07T20:22:54.7292231Z * branch 420084499c7c1e1c2d801922f40df202eac5f3a0 -> FETCH_HEAD 2025-05-07T20:22:54.7374366Z Submodule path 'external/hipify_torch': checked out '420084499c7c1e1c2d801922f40df202eac5f3a0' 2025-05-07T20:22:55.4578239Z From https://github.com/nlohmann/json 2025-05-07T20:22:55.4578843Z * branch 9cca280a4d0ccf0c08f47a99aa71d1b0e52f8d03 -> FETCH_HEAD 2025-05-07T20:22:55.5716112Z Submodule path 'external/json': checked out '9cca280a4d0ccf0c08f47a99aa71d1b0e52f8d03' 2025-05-07T20:22:55.5735834Z [command]/usr/bin/git submodule foreach git config --local gc.auto 0 2025-05-07T20:22:55.6066853Z Entering 'external/asmjit' 2025-05-07T20:22:55.6099493Z Entering 'external/composable_kernel' 2025-05-07T20:22:55.6131478Z Entering 'external/cpuinfo' 2025-05-07T20:22:55.6163373Z Entering 'external/cutlass' 2025-05-07T20:22:55.6195618Z Entering 'external/googletest' 2025-05-07T20:22:55.6227739Z Entering 'external/hipify_torch' 2025-05-07T20:22:55.6259334Z Entering 'external/json' 2025-05-07T20:22:55.6303574Z ##[endgroup] 2025-05-07T20:22:55.6304065Z ##[group]Persisting credentials for submodules 2025-05-07T20:22:55.6310237Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'url\.https\:\/\/github\.com\/\.insteadOf' && git config --local --unset-all 'url.https://github.com/.insteadOf' || :" 2025-05-07T20:22:55.6638120Z Entering 'external/asmjit' 2025-05-07T20:22:55.6705581Z Entering 'external/composable_kernel' 2025-05-07T20:22:55.6775960Z Entering 'external/cpuinfo' 2025-05-07T20:22:55.6842486Z Entering 'external/cutlass' 2025-05-07T20:22:55.6916272Z Entering 'external/googletest' 2025-05-07T20:22:55.6983262Z Entering 'external/hipify_torch' 2025-05-07T20:22:55.7051780Z Entering 'external/json' 2025-05-07T20:22:55.7135852Z [command]/usr/bin/git submodule foreach sh -c "git config --local 'http.https://github.com/.extraheader' 'AUTHORIZATION: basic ***' && git config --local --show-origin --name-only --get-regexp remote.origin.url" 2025-05-07T20:22:55.7463198Z Entering 'external/asmjit' 2025-05-07T20:22:55.7525718Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/asmjit/config remote.origin.url 2025-05-07T20:22:55.7527961Z Entering 'external/composable_kernel' 2025-05-07T20:22:55.7590334Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/composable_kernel/config remote.origin.url 2025-05-07T20:22:55.7593555Z Entering 'external/cpuinfo' 2025-05-07T20:22:55.7654410Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/cpuinfo/config remote.origin.url 2025-05-07T20:22:55.7658150Z Entering 'external/cutlass' 2025-05-07T20:22:55.7717335Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/cutlass/config remote.origin.url 2025-05-07T20:22:55.7720283Z Entering 'external/googletest' 2025-05-07T20:22:55.7779861Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/googletest/config remote.origin.url 2025-05-07T20:22:55.7783506Z Entering 'external/hipify_torch' 2025-05-07T20:22:55.7843156Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/hipify_torch/config remote.origin.url 2025-05-07T20:22:55.7846268Z Entering 'external/json' 2025-05-07T20:22:55.7908962Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/json/config remote.origin.url 2025-05-07T20:22:55.8002545Z [command]/usr/bin/git submodule foreach git config --local --add 'url.https://github.com/.insteadOf' 'git@github.com:' 2025-05-07T20:22:55.8333932Z Entering 'external/asmjit' 2025-05-07T20:22:55.8365923Z Entering 'external/composable_kernel' 2025-05-07T20:22:55.8397686Z Entering 'external/cpuinfo' 2025-05-07T20:22:55.8429075Z Entering 'external/cutlass' 2025-05-07T20:22:55.8460557Z Entering 'external/googletest' 2025-05-07T20:22:55.8492752Z Entering 'external/hipify_torch' 2025-05-07T20:22:55.8524069Z Entering 'external/json' 2025-05-07T20:22:55.8571369Z [command]/usr/bin/git submodule foreach git config --local --add 'url.https://github.com/.insteadOf' 'org-21003710@github.com:' 2025-05-07T20:22:55.8903002Z Entering 'external/asmjit' 2025-05-07T20:22:55.8933574Z Entering 'external/composable_kernel' 2025-05-07T20:22:55.8964889Z Entering 'external/cpuinfo' 2025-05-07T20:22:55.8996766Z Entering 'external/cutlass' 2025-05-07T20:22:55.9028068Z Entering 'external/googletest' 2025-05-07T20:22:55.9060046Z Entering 'external/hipify_torch' 2025-05-07T20:22:55.9091239Z Entering 'external/json' 2025-05-07T20:22:55.9152694Z ##[endgroup] 2025-05-07T20:22:55.9177170Z [command]/usr/bin/git log -1 --format=%H 2025-05-07T20:22:55.9203927Z a2f4c52051596e74bc8c16e3d2867a4ecdd271e0 2025-05-07T20:22:55.9392190Z ##[group]Run actions/download-artifact@v4 2025-05-07T20:22:55.9392502Z with: 2025-05-07T20:22:55.9392748Z name: fbgemm_genai_x86_clang_py3.9_cu12.8.0.whl 2025-05-07T20:22:55.9393072Z merge-multiple: false 2025-05-07T20:22:55.9393327Z repository: pytorch/FBGEMM 2025-05-07T20:22:55.9393586Z run-id: 14891846252 2025-05-07T20:22:55.9393790Z env: 2025-05-07T20:22:55.9394015Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:22:55.9394310Z BUILD_ENV: build_binary 2025-05-07T20:22:55.9394549Z BUILD_TARGET: genai 2025-05-07T20:22:55.9394761Z BUILD_VARIANT: cuda 2025-05-07T20:22:55.9394990Z BUILD_CUDA_VERSION: 12.8.0 2025-05-07T20:22:55.9395235Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:22:55.9395475Z ##[endgroup] 2025-05-07T20:22:56.1749831Z Downloading single artifact 2025-05-07T20:22:56.2606384Z Preparing to download the following artifacts: 2025-05-07T20:22:56.2607284Z - fbgemm_genai_x86_clang_py3.9_cu12.8.0.whl (ID: 3081405239, Size: 18501145, Expected Digest: sha256:49d17600359b05f780104ac5b5c7182a7fffa14a07ce833b6d20dd778f161f31) 2025-05-07T20:22:56.3137266Z Redirecting to blob download url: https://productionresultssa4.blob.core.windows.net/actions-results/b81c1ade-b872-4473-afc9-b227c140a38f/workflow-job-run-c9082d77-4e4e-5fd7-9873-085c291b0b68/artifacts/dcc1e5fc208536aec3c652f2daa6cf51fe28cc42d2657b9bbfc350fdc93bbce4.zip 2025-05-07T20:22:56.3138861Z Starting download of artifact to: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM 2025-05-07T20:22:56.4301122Z (node:57020) [DEP0005] DeprecationWarning: Buffer() is deprecated due to security and usability issues. Please use the Buffer.alloc(), Buffer.allocUnsafe(), or Buffer.from() methods instead. 2025-05-07T20:22:56.4302162Z (Use `node --trace-deprecation ...` to show where the warning was created) 2025-05-07T20:22:56.7200718Z SHA256 digest of downloaded artifact is 49d17600359b05f780104ac5b5c7182a7fffa14a07ce833b6d20dd778f161f31 2025-05-07T20:22:56.7201380Z Artifact download completed successfully. 2025-05-07T20:22:56.7201720Z Total of 1 artifact(s) downloaded 2025-05-07T20:22:56.7207024Z Download artifact has finished successfully 2025-05-07T20:22:56.7532218Z ##[group]Run pytorch/test-infra/.github/actions/setup-nvidia@main 2025-05-07T20:22:56.7532838Z with: 2025-05-07T20:22:56.7533159Z driver-version: 570.133.07 2025-05-07T20:22:56.7533554Z env: 2025-05-07T20:22:56.7533881Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:22:56.7534355Z BUILD_ENV: build_binary 2025-05-07T20:22:56.7534732Z BUILD_TARGET: genai 2025-05-07T20:22:56.7535073Z BUILD_VARIANT: cuda 2025-05-07T20:22:56.7535432Z BUILD_CUDA_VERSION: 12.8.0 2025-05-07T20:22:56.7535835Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:22:56.7536202Z ##[endgroup] 2025-05-07T20:22:56.7638534Z ##[group]Run nick-fields/retry@3e91a01664abd3c5cd539100d10d33b9c5b68482 2025-05-07T20:22:56.7638936Z with: 2025-05-07T20:22:56.7639337Z timeout_minutes: 10 2025-05-07T20:22:56.7639573Z max_attempts: 3 2025-05-07T20:22:56.7665177Z command: # Is it disgusting to have a full shell script here in this github action? Sure # But is it the best way to make it so that this action relies on nothing else? Absolutely set -eou pipefail DISTRIBUTION=$(. /etc/os-release;echo $ID$VERSION_ID) DRIVER_FN="NVIDIA-Linux-x86_64-${DRIVER_VERSION}.run" install_nvidia_docker2_amzn2() { ( set -x # Needed for yum-config-manager sudo yum install -y yum-utils if [[ "${DISTRIBUTION}" == "amzn2023" ]] ; then YUM_REPO_URL="https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo" else # Amazon Linux 2 YUM_REPO_URL="https://nvidia.github.io/nvidia-docker/${DISTRIBUTION}/nvidia-docker.repo" fi sudo yum-config-manager --add-repo "${YUM_REPO_URL}" sudo yum install -y nvidia-docker2 nvidia-container-toolkit-1.16.2 sudo systemctl restart docker ) } install_nvidia_docker2_ubuntu20() { ( set -x # Install nvidia-driver package if not installed status="$(dpkg-query -W --showformat='${db:Status-Status}' nvidia-docker2 2>&1)" if [ ! $? = 0 ] || [ ! "$status" = installed ]; then sudo apt-get install -y nvidia-docker2 nvidia-container-toolkit-1.16.2 sudo systemctl restart docker fi ) } pre_install_nvidia_driver_amzn2() { ( # Purge any nvidia driver installed from RHEL repo sudo yum remove -y nvidia-driver-latest-dkms ) } install_nvidia_driver_common() { ( # Try to gather more information about the runner and its existing NVIDIA driver if any echo "Before installing NVIDIA driver" lspci lsmod modinfo nvidia || true HAS_NVIDIA_DRIVER=0 # Check if NVIDIA driver has already been installed if [ -x "$(command -v nvidia-smi)" ]; then set +e # The driver exists, check its version next. Also check only the first GPU if there are more than one of them # so that the same driver version is not print over multiple lines INSTALLED_DRIVER_VERSION=$(nvidia-smi --query-gpu=driver_version --format=csv,noheader --id=0) NVIDIA_SMI_STATUS=$? if [ "$NVIDIA_SMI_STATUS" -ne 0 ] && [ "$NVIDIA_SMI_STATUS" -ne 14 ]; then echo "Failed to get NVIDIA driver version ($INSTALLED_DRIVER_VERSION). Continuing" elif [ "$INSTALLED_DRIVER_VERSION" != "$DRIVER_VERSION" ]; then echo "NVIDIA driver ($INSTALLED_DRIVER_VERSION) has been installed, but we expect to have $DRIVER_VERSION instead. Continuing" # Turn off persistent mode so that the installation script can unload the kernel module sudo killall nvidia-persistenced || true else HAS_NVIDIA_DRIVER=1 echo "NVIDIA driver ($INSTALLED_DRIVER_VERSION) has already been installed. Skipping NVIDIA driver installation" fi set -e fi if [ "$HAS_NVIDIA_DRIVER" -eq 0 ]; then # CAUTION: this may need to be updated in future if [ "${DISTRIBUTION}" != ubuntu20.04 ]; then sudo yum groupinstall -y "Development Tools" # ensure our kernel install is the same as our underlying kernel, # groupinstall "Development Tools" has a habit of mismatching kernel headers sudo yum install -y "kernel-devel-uname-r == $(uname -r)" sudo modprobe backlight fi sudo curl -fsL -o /tmp/nvidia_driver "https://s3.amazonaws.com/ossci-linux/nvidia_driver/$DRIVER_FN" set +e sudo /bin/bash /tmp/nvidia_driver -s --no-drm NVIDIA_INSTALLATION_STATUS=$? RESET_GPU=0 if [ "$NVIDIA_INSTALLATION_STATUS" -ne 0 ]; then sudo cat /var/log/nvidia-installer.log # Fail to install NVIDIA driver, try to reset the GPU RESET_GPU=1 elif [ -x "$(command -v nvidia-smi)" ]; then # Check again if nvidia-smi works even if the driver installation completes successfully INSTALLED_DRIVER_VERSION=$(nvidia-smi --query-gpu=driver_version --format=csv,noheader --id=0) NVIDIA_SMI_STATUS=$? if [ "$NVIDIA_SMI_STATUS" -ne 0 ] && [ "$NVIDIA_SMI_STATUS" -ne 14 ]; then RESET_GPU=1 fi fi if [ "$RESET_GPU" -eq 1 ]; then NVIDIA_DEVICES=$(lspci -D | grep -i NVIDIA | cut -d' ' -f1) # The GPU can get stuck in a failure state if somehow the test crashs the GPU microcode. When this # happens, we'll try to reset all NVIDIA devices https://github.com/pytorch/pytorch/issues/88388 for PCI_ID in $NVIDIA_DEVICES; do DEVICE_ENABLED=$(cat /sys/bus/pci/devices/$PCI_ID/enable) echo "Reseting $PCI_ID (enabled state: $DEVICE_ENABLED)" # This requires sudo permission of course echo "1" | sudo tee /sys/bus/pci/devices/$PCI_ID/reset sleep 1 done fi sudo rm -fv /tmp/nvidia_driver set -e fi ) } post_install_nvidia_driver_common() { ( sudo modprobe nvidia || true echo "After installing NVIDIA driver" lspci lsmod modinfo nvidia || true ( set +e nvidia-smi # NB: Annoyingly, nvidia-smi command returns successfully with return code 0 even in # the case where the driver has already crashed as it still can get the driver version # and some basic information like the bus ID. However, the rest of the information # would be missing (ERR!), for example: # # +-----------------------------------------------------------------------------+ # | NVIDIA-SMI 525.89.02 Driver Version: 525.89.02 CUDA Version: 12.0 | # |-------------------------------+----------------------+----------------------+ # | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | # | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | # | | | MIG M. | # |===============================+======================+======================| # | 0 ERR! Off | 00000000:00:1E.0 Off | ERR! | # |ERR! ERR! ERR! ERR! / ERR! | 4184MiB / 23028MiB | ERR! Default | # | | | ERR! | # +-------------------------------+----------------------+----------------------+ # # +-----------------------------------------------------------------------------+ # | Processes: | # | GPU GI CI PID Type Process name GPU Memory | # | ID ID Usage | # |=============================================================================| # +-----------------------------------------------------------------------------+ # # This should be reported as a failure instead as it will guarantee to fail when # Docker tries to run with --gpus all # # So, the correct check here is to query one of the missing piece of info like # GPU name, so that the command can fail accordingly nvidia-smi --query-gpu=gpu_name --format=csv,noheader --id=0 NVIDIA_SMI_STATUS=$? # Allowable exit statuses for nvidia-smi, see: https://github.com/NVIDIA/gpu-operator/issues/285 if [ "$NVIDIA_SMI_STATUS" -eq 0 ] || [ "$NVIDIA_SMI_STATUS" -eq 14 ]; then echo "INFO: Ignoring allowed status ${NVIDIA_SMI_STATUS}" else echo "ERROR: nvidia-smi exited with unresolved status ${NVIDIA_SMI_STATUS}" exit ${NVIDIA_SMI_STATUS} fi set -e ) ) } install_nvidia_driver_amzn2() { ( set -x pre_install_nvidia_driver_amzn2 install_nvidia_driver_common post_install_nvidia_driver_common ) } install_nvidia_driver_ubuntu20() { ( set -x install_nvidia_driver_common post_install_nvidia_driver_common ) } echo "== Installing nvidia driver ${DRIVER_FN} ==" case "${DISTRIBUTION}" in amzn*) install_nvidia_driver_amzn2 ;; ubuntu20.04) install_nvidia_driver_ubuntu20 ;; *) echo "ERROR: Unknown distribution ${DISTRIBUTION}" exit 1 ;; esac # Install container toolkit based on distribution echo "== Installing nvidia container toolkit for ${DISTRIBUTION} ==" case "${DISTRIBUTION}" in amzn*) install_nvidia_docker2_amzn2 ;; ubuntu20.04) install_nvidia_docker2_ubuntu20 ;; *) echo "ERROR: Unknown distribution ${DISTRIBUTION}" exit 1 ;; esac echo "GPU_FLAG=--gpus all -e NVIDIA_DRIVER_CAPABILITIES=all" >> "${GITHUB_ENV}" # Fix https://github.com/NVIDIA/nvidia-docker/issues/1648 on runners with # more than one GPUs. This just needs to be run once. The command fails # on subsequent runs and complains that the mode is already on, but that's # ok sudo nvidia-persistenced || true # This should show persistence mode ON nvidia-smi 2025-05-07T20:22:56.7691437Z retry_wait_seconds: 10 2025-05-07T20:22:56.7691702Z polling_interval_seconds: 1 2025-05-07T20:22:56.7691968Z warning_on_retry: true 2025-05-07T20:22:56.7692225Z continue_on_error: false 2025-05-07T20:22:56.7692476Z env: 2025-05-07T20:22:56.7692705Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:22:56.7693021Z BUILD_ENV: build_binary 2025-05-07T20:22:56.7693275Z BUILD_TARGET: genai 2025-05-07T20:22:56.7693509Z BUILD_VARIANT: cuda 2025-05-07T20:22:56.7693761Z BUILD_CUDA_VERSION: 12.8.0 2025-05-07T20:22:56.7694030Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:22:56.7694281Z DRIVER_VERSION: 570.133.07 2025-05-07T20:22:56.7694541Z ##[endgroup] 2025-05-07T20:22:56.8514162Z == Installing nvidia driver NVIDIA-Linux-x86_64-570.133.07.run == 2025-05-07T20:22:56.8514909Z + pre_install_nvidia_driver_amzn2 2025-05-07T20:22:56.8518527Z + sudo yum remove -y nvidia-driver-latest-dkms 2025-05-07T20:22:57.4361275Z No match for argument: nvidia-driver-latest-dkms 2025-05-07T20:22:57.4361669Z No packages marked for removal. 2025-05-07T20:22:57.4423804Z Dependencies resolved. 2025-05-07T20:22:57.4437653Z Nothing to do. 2025-05-07T20:22:57.4437884Z Complete! 2025-05-07T20:22:57.4758564Z + install_nvidia_driver_common 2025-05-07T20:22:57.4763080Z + echo 'Before installing NVIDIA driver' 2025-05-07T20:22:57.4763450Z + lspci 2025-05-07T20:22:57.4764105Z Before installing NVIDIA driver 2025-05-07T20:22:57.4965368Z 00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] 2025-05-07T20:22:57.4966222Z 00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II] 2025-05-07T20:22:57.4966809Z 00:01.3 Non-VGA unclassified device: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 08) 2025-05-07T20:22:57.4967346Z 00:03.0 VGA compatible controller: Amazon.com, Inc. Device 1111 2025-05-07T20:22:57.4968060Z 00:04.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe EBS Controller 2025-05-07T20:22:57.4968654Z 00:05.0 Ethernet controller: Amazon.com, Inc. Elastic Network Adapter (ENA) 2025-05-07T20:22:57.4969152Z 00:1e.0 3D controller: NVIDIA Corporation GA102GL [A10G] (rev a1) 2025-05-07T20:22:57.4969648Z 00:1f.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe SSD Controller 2025-05-07T20:22:57.4970070Z + lsmod 2025-05-07T20:22:57.5011514Z Module Size Used by 2025-05-07T20:22:57.5011809Z xt_conntrack 16384 1 2025-05-07T20:22:57.5012084Z nft_chain_nat 16384 3 2025-05-07T20:22:57.5012350Z xt_MASQUERADE 20480 1 2025-05-07T20:22:57.5012660Z nf_nat 57344 2 nft_chain_nat,xt_MASQUERADE 2025-05-07T20:22:57.5013005Z nf_conntrack_netlink 57344 0 2025-05-07T20:22:57.5013418Z nf_conntrack 184320 4 xt_conntrack,nf_nat,nf_conntrack_netlink,xt_MASQUERADE 2025-05-07T20:22:57.5013866Z nf_defrag_ipv6 24576 1 nf_conntrack 2025-05-07T20:22:57.5014225Z nf_defrag_ipv4 16384 1 nf_conntrack 2025-05-07T20:22:57.5014526Z xfrm_user 57344 1 2025-05-07T20:22:57.5014821Z xfrm_algo 16384 1 xfrm_user 2025-05-07T20:22:57.5015134Z xt_addrtype 16384 2 2025-05-07T20:22:57.5015397Z nft_compat 20480 4 2025-05-07T20:22:57.5015707Z nf_tables 311296 57 nft_compat,nft_chain_nat 2025-05-07T20:22:57.5016128Z nfnetlink 20480 4 nft_compat,nf_conntrack_netlink,nf_tables 2025-05-07T20:22:57.5016530Z br_netfilter 36864 0 2025-05-07T20:22:57.5016807Z bridge 323584 1 br_netfilter 2025-05-07T20:22:57.5017120Z stp 16384 1 bridge 2025-05-07T20:22:57.5017411Z llc 16384 2 bridge,stp 2025-05-07T20:22:57.5017703Z overlay 167936 0 2025-05-07T20:22:57.5017956Z tls 135168 0 2025-05-07T20:22:57.5018199Z nls_ascii 16384 1 2025-05-07T20:22:57.5018453Z nls_cp437 20480 1 2025-05-07T20:22:57.5018704Z vfat 24576 1 2025-05-07T20:22:57.5018956Z fat 86016 1 vfat 2025-05-07T20:22:57.5019230Z sunrpc 696320 1 2025-05-07T20:22:57.5019480Z ena 180224 0 2025-05-07T20:22:57.5019721Z i8042 45056 0 2025-05-07T20:22:57.5019975Z serio 28672 3 i8042 2025-05-07T20:22:57.5020256Z ghash_clmulni_intel 16384 0 2025-05-07T20:22:57.5020523Z button 24576 0 2025-05-07T20:22:57.5020779Z sch_fq_codel 20480 17 2025-05-07T20:22:57.5021039Z dm_mod 188416 0 2025-05-07T20:22:57.5021293Z loop 36864 0 2025-05-07T20:22:57.5021539Z fuse 163840 1 2025-05-07T20:22:57.5021790Z configfs 57344 1 2025-05-07T20:22:57.5022050Z dax 45056 1 dm_mod 2025-05-07T20:22:57.5022322Z dmi_sysfs 20480 0 2025-05-07T20:22:57.5022580Z crc32_pclmul 16384 0 2025-05-07T20:22:57.5022838Z crc32c_intel 24576 0 2025-05-07T20:22:57.5023088Z efivarfs 24576 1 2025-05-07T20:22:57.5023345Z + modinfo nvidia 2025-05-07T20:22:57.5031240Z filename: /lib/modules/6.1.130-139.222.amzn2023.x86_64/kernel/drivers/video/nvidia.ko 2025-05-07T20:22:57.5031739Z import_ns: DMA_BUF 2025-05-07T20:22:57.5031984Z alias: char-major-195-* 2025-05-07T20:22:57.5032263Z version: 570.133.07 2025-05-07T20:22:57.5032518Z supported: external 2025-05-07T20:22:57.5032925Z license: Dual MIT/GPL 2025-05-07T20:22:57.5033234Z firmware: nvidia/570.133.07/gsp_tu10x.bin 2025-05-07T20:22:57.5033581Z firmware: nvidia/570.133.07/gsp_ga10x.bin 2025-05-07T20:22:57.5034312Z srcversion: 49515739FD8F721A3F2F714 2025-05-07T20:22:57.5034640Z alias: pci:v000010DEd*sv*sd*bc06sc80i00* 2025-05-07T20:22:57.5034991Z alias: pci:v000010DEd*sv*sd*bc03sc02i00* 2025-05-07T20:22:57.5035334Z alias: pci:v000010DEd*sv*sd*bc03sc00i00* 2025-05-07T20:22:57.5035655Z depends: i2c-core,drm 2025-05-07T20:22:57.5035908Z retpoline: Y 2025-05-07T20:22:57.5036126Z name: nvidia 2025-05-07T20:22:57.5036494Z vermagic: 6.1.130-139.222.amzn2023.x86_64 SMP preempt mod_unload modversions 2025-05-07T20:22:57.5036976Z parm: NvSwitchRegDwords:NvSwitch regkey (charp) 2025-05-07T20:22:57.5037447Z parm: NvSwitchBlacklist:NvSwitchBlacklist=uuid[,uuid...] (charp) 2025-05-07T20:22:57.5038000Z parm: NVreg_ResmanDebugLevel:int 2025-05-07T20:22:57.5038311Z parm: NVreg_RmLogonRC:int 2025-05-07T20:22:57.5038626Z parm: NVreg_ModifyDeviceFiles:int 2025-05-07T20:22:57.5038950Z parm: NVreg_DeviceFileUID:int 2025-05-07T20:22:57.5039253Z parm: NVreg_DeviceFileGID:int 2025-05-07T20:22:57.5039570Z parm: NVreg_DeviceFileMode:int 2025-05-07T20:22:57.5039937Z parm: NVreg_InitializeSystemMemoryAllocations:int 2025-05-07T20:22:57.5040333Z parm: NVreg_UsePageAttributeTable:int 2025-05-07T20:22:57.5040672Z parm: NVreg_EnablePCIeGen3:int 2025-05-07T20:22:57.5040980Z parm: NVreg_EnableMSI:int 2025-05-07T20:22:57.5041285Z parm: NVreg_EnableStreamMemOPs:int 2025-05-07T20:22:57.5041646Z parm: NVreg_RestrictProfilingToAdminUsers:int 2025-05-07T20:22:57.5042046Z parm: NVreg_PreserveVideoMemoryAllocations:int 2025-05-07T20:22:57.5042431Z parm: NVreg_EnableS0ixPowerManagement:int 2025-05-07T20:22:57.5042849Z parm: NVreg_S0ixPowerManagementVideoMemoryThreshold:int 2025-05-07T20:22:57.5043265Z parm: NVreg_DynamicPowerManagement:int 2025-05-07T20:22:57.5043698Z parm: NVreg_DynamicPowerManagementVideoMemoryThreshold:int 2025-05-07T20:22:57.5044126Z parm: NVreg_EnableGpuFirmware:int 2025-05-07T20:22:57.5044468Z parm: NVreg_EnableGpuFirmwareLogs:int 2025-05-07T20:22:57.5044897Z parm: NVreg_OpenRmEnableUnsupportedGpus:int 2025-05-07T20:22:57.5045281Z parm: NVreg_EnableUserNUMAManagement:int 2025-05-07T20:22:57.5045619Z parm: NVreg_MemoryPoolSize:int 2025-05-07T20:22:57.5045947Z parm: NVreg_KMallocHeapMaxSize:int 2025-05-07T20:22:57.5046284Z parm: NVreg_VMallocHeapMaxSize:int 2025-05-07T20:22:57.5046606Z parm: NVreg_IgnoreMMIOCheck:int 2025-05-07T20:22:57.5046921Z parm: NVreg_NvLinkDisable:int 2025-05-07T20:22:57.5047279Z parm: NVreg_EnablePCIERelaxedOrderingMode:int 2025-05-07T20:22:57.5047653Z parm: NVreg_RegisterPCIDriver:int 2025-05-07T20:22:57.5047977Z parm: NVreg_EnableResizableBar:int 2025-05-07T20:22:57.5048322Z parm: NVreg_EnableDbgBreakpoint:int 2025-05-07T20:22:57.5048683Z parm: NVreg_EnableNonblockingOpen:int 2025-05-07T20:22:57.5049020Z parm: NVreg_RegistryDwords:charp 2025-05-07T20:22:57.5049370Z parm: NVreg_RegistryDwordsPerDevice:charp 2025-05-07T20:22:57.5049712Z parm: NVreg_RmMsg:charp 2025-05-07T20:22:57.5050001Z parm: NVreg_GpuBlacklist:charp 2025-05-07T20:22:57.5050335Z parm: NVreg_TemporaryFilePath:charp 2025-05-07T20:22:57.5050677Z parm: NVreg_ExcludedGpus:charp 2025-05-07T20:22:57.5050990Z parm: NVreg_DmaRemapPeerMmio:int 2025-05-07T20:22:57.5051325Z parm: NVreg_RmNvlinkBandwidth:charp 2025-05-07T20:22:57.5051694Z parm: NVreg_RmNvlinkBandwidthLinkCount:int 2025-05-07T20:22:57.5052047Z parm: NVreg_ImexChannelCount:int 2025-05-07T20:22:57.5052377Z parm: NVreg_CreateImexChannel0:int 2025-05-07T20:22:57.5052731Z parm: NVreg_GrdmaPciTopoCheckOverride:int 2025-05-07T20:22:57.5053083Z parm: rm_firmware_active:charp 2025-05-07T20:22:57.5053477Z + HAS_NVIDIA_DRIVER=0 2025-05-07T20:22:57.5053729Z ++ command -v nvidia-smi 2025-05-07T20:22:57.5053996Z + '[' -x /usr/bin/nvidia-smi ']' 2025-05-07T20:22:57.5054256Z + set +e 2025-05-07T20:22:57.5054576Z ++ nvidia-smi --query-gpu=driver_version --format=csv,noheader --id=0 2025-05-07T20:22:59.3286367Z + INSTALLED_DRIVER_VERSION=570.133.07 2025-05-07T20:22:59.3286745Z + NVIDIA_SMI_STATUS=0 2025-05-07T20:22:59.3287061Z + '[' 0 -ne 0 ']' 2025-05-07T20:22:59.3287366Z + '[' 570.133.07 '!=' 570.133.07 ']' 2025-05-07T20:22:59.3287753Z + HAS_NVIDIA_DRIVER=1 2025-05-07T20:22:59.3288374Z + echo 'NVIDIA driver (570.133.07) has already been installed. Skipping NVIDIA driver installation' 2025-05-07T20:22:59.3289041Z + set -e 2025-05-07T20:22:59.3289787Z + '[' 1 -eq 0 ']' 2025-05-07T20:22:59.3290347Z NVIDIA driver (570.133.07) has already been installed. Skipping NVIDIA driver installation 2025-05-07T20:22:59.3290904Z + post_install_nvidia_driver_common 2025-05-07T20:22:59.3294668Z + sudo modprobe nvidia 2025-05-07T20:22:59.4778632Z + echo 'After installing NVIDIA driver' 2025-05-07T20:22:59.4778954Z + lspci 2025-05-07T20:22:59.4779427Z After installing NVIDIA driver 2025-05-07T20:22:59.4899513Z 00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] 2025-05-07T20:22:59.4900031Z 00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II] 2025-05-07T20:22:59.4900607Z 00:01.3 Non-VGA unclassified device: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 08) 2025-05-07T20:22:59.4901155Z 00:03.0 VGA compatible controller: Amazon.com, Inc. Device 1111 2025-05-07T20:22:59.4901647Z 00:04.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe EBS Controller 2025-05-07T20:22:59.4902198Z 00:05.0 Ethernet controller: Amazon.com, Inc. Elastic Network Adapter (ENA) 2025-05-07T20:22:59.4902719Z 00:1e.0 3D controller: NVIDIA Corporation GA102GL [A10G] (rev a1) 2025-05-07T20:22:59.4903209Z 00:1f.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe SSD Controller 2025-05-07T20:22:59.4903634Z + lsmod 2025-05-07T20:22:59.4931408Z Module Size Used by 2025-05-07T20:22:59.4931706Z nvidia_uvm 1884160 0 2025-05-07T20:22:59.4932147Z nvidia 11583488 1 nvidia_uvm 2025-05-07T20:22:59.4932478Z drm 602112 1 nvidia 2025-05-07T20:22:59.4932792Z drm_panel_orientation_quirks 32768 1 drm 2025-05-07T20:22:59.4933115Z backlight 24576 1 drm 2025-05-07T20:22:59.4933409Z i2c_core 110592 2 nvidia,drm 2025-05-07T20:22:59.4933711Z xt_conntrack 16384 1 2025-05-07T20:22:59.4933967Z nft_chain_nat 16384 3 2025-05-07T20:22:59.4934230Z xt_MASQUERADE 20480 1 2025-05-07T20:22:59.4934549Z nf_nat 57344 2 nft_chain_nat,xt_MASQUERADE 2025-05-07T20:22:59.4934892Z nf_conntrack_netlink 57344 0 2025-05-07T20:22:59.4935299Z nf_conntrack 184320 4 xt_conntrack,nf_nat,nf_conntrack_netlink,xt_MASQUERADE 2025-05-07T20:22:59.4935755Z nf_defrag_ipv6 24576 1 nf_conntrack 2025-05-07T20:22:59.4936080Z nf_defrag_ipv4 16384 1 nf_conntrack 2025-05-07T20:22:59.4936377Z xfrm_user 57344 1 2025-05-07T20:22:59.4936648Z xfrm_algo 16384 1 xfrm_user 2025-05-07T20:22:59.4936953Z xt_addrtype 16384 2 2025-05-07T20:22:59.4937207Z nft_compat 20480 4 2025-05-07T20:22:59.4937513Z nf_tables 311296 57 nft_compat,nft_chain_nat 2025-05-07T20:22:59.4937943Z nfnetlink 20480 4 nft_compat,nf_conntrack_netlink,nf_tables 2025-05-07T20:22:59.4938326Z br_netfilter 36864 0 2025-05-07T20:22:59.4938602Z bridge 323584 1 br_netfilter 2025-05-07T20:22:59.4938902Z stp 16384 1 bridge 2025-05-07T20:22:59.4939184Z llc 16384 2 bridge,stp 2025-05-07T20:22:59.4939478Z overlay 167936 0 2025-05-07T20:22:59.4939729Z tls 135168 0 2025-05-07T20:22:59.4939984Z nls_ascii 16384 1 2025-05-07T20:22:59.4940523Z nls_cp437 20480 1 2025-05-07T20:22:59.4940777Z vfat 24576 1 2025-05-07T20:22:59.4941036Z fat 86016 1 vfat 2025-05-07T20:22:59.4941296Z sunrpc 696320 1 2025-05-07T20:22:59.4941546Z ena 180224 0 2025-05-07T20:22:59.4941789Z i8042 45056 0 2025-05-07T20:22:59.4942034Z serio 28672 3 i8042 2025-05-07T20:22:59.4942315Z ghash_clmulni_intel 16384 0 2025-05-07T20:22:59.4942570Z button 24576 0 2025-05-07T20:22:59.4942818Z sch_fq_codel 20480 17 2025-05-07T20:22:59.4943077Z dm_mod 188416 0 2025-05-07T20:22:59.4943324Z loop 36864 0 2025-05-07T20:22:59.4943561Z fuse 163840 1 2025-05-07T20:22:59.4943942Z configfs 57344 1 2025-05-07T20:22:59.4944199Z dax 45056 1 dm_mod 2025-05-07T20:22:59.4944475Z dmi_sysfs 20480 0 2025-05-07T20:22:59.4944721Z crc32_pclmul 16384 0 2025-05-07T20:22:59.4944980Z crc32c_intel 24576 0 2025-05-07T20:22:59.4945233Z efivarfs 24576 1 2025-05-07T20:22:59.4945474Z + modinfo nvidia 2025-05-07T20:22:59.4949536Z filename: /lib/modules/6.1.130-139.222.amzn2023.x86_64/kernel/drivers/video/nvidia.ko 2025-05-07T20:22:59.4950159Z import_ns: DMA_BUF 2025-05-07T20:22:59.4950401Z alias: char-major-195-* 2025-05-07T20:22:59.4950660Z version: 570.133.07 2025-05-07T20:22:59.4950899Z supported: external 2025-05-07T20:22:59.4951142Z license: Dual MIT/GPL 2025-05-07T20:22:59.4951423Z firmware: nvidia/570.133.07/gsp_tu10x.bin 2025-05-07T20:22:59.4951767Z firmware: nvidia/570.133.07/gsp_ga10x.bin 2025-05-07T20:22:59.4952090Z srcversion: 49515739FD8F721A3F2F714 2025-05-07T20:22:59.4952411Z alias: pci:v000010DEd*sv*sd*bc06sc80i00* 2025-05-07T20:22:59.4952755Z alias: pci:v000010DEd*sv*sd*bc03sc02i00* 2025-05-07T20:22:59.4953094Z alias: pci:v000010DEd*sv*sd*bc03sc00i00* 2025-05-07T20:22:59.4953408Z depends: i2c-core,drm 2025-05-07T20:22:59.4953664Z retpoline: Y 2025-05-07T20:22:59.4953875Z name: nvidia 2025-05-07T20:22:59.4954242Z vermagic: 6.1.130-139.222.amzn2023.x86_64 SMP preempt mod_unload modversions 2025-05-07T20:22:59.4954723Z parm: NvSwitchRegDwords:NvSwitch regkey (charp) 2025-05-07T20:22:59.4955180Z parm: NvSwitchBlacklist:NvSwitchBlacklist=uuid[,uuid...] (charp) 2025-05-07T20:22:59.4955609Z parm: NVreg_ResmanDebugLevel:int 2025-05-07T20:22:59.4955917Z parm: NVreg_RmLogonRC:int 2025-05-07T20:22:59.4956217Z parm: NVreg_ModifyDeviceFiles:int 2025-05-07T20:22:59.4956533Z parm: NVreg_DeviceFileUID:int 2025-05-07T20:22:59.4956829Z parm: NVreg_DeviceFileGID:int 2025-05-07T20:22:59.4957141Z parm: NVreg_DeviceFileMode:int 2025-05-07T20:22:59.4957506Z parm: NVreg_InitializeSystemMemoryAllocations:int 2025-05-07T20:22:59.4957902Z parm: NVreg_UsePageAttributeTable:int 2025-05-07T20:22:59.4958231Z parm: NVreg_EnablePCIeGen3:int 2025-05-07T20:22:59.4958539Z parm: NVreg_EnableMSI:int 2025-05-07T20:22:59.4958846Z parm: NVreg_EnableStreamMemOPs:int 2025-05-07T20:22:59.4959205Z parm: NVreg_RestrictProfilingToAdminUsers:int 2025-05-07T20:22:59.4959606Z parm: NVreg_PreserveVideoMemoryAllocations:int 2025-05-07T20:22:59.4959994Z parm: NVreg_EnableS0ixPowerManagement:int 2025-05-07T20:22:59.4960407Z parm: NVreg_S0ixPowerManagementVideoMemoryThreshold:int 2025-05-07T20:22:59.4960818Z parm: NVreg_DynamicPowerManagement:int 2025-05-07T20:22:59.4961239Z parm: NVreg_DynamicPowerManagementVideoMemoryThreshold:int 2025-05-07T20:22:59.4961656Z parm: NVreg_EnableGpuFirmware:int 2025-05-07T20:22:59.4961987Z parm: NVreg_EnableGpuFirmwareLogs:int 2025-05-07T20:22:59.4962357Z parm: NVreg_OpenRmEnableUnsupportedGpus:int 2025-05-07T20:22:59.4962841Z parm: NVreg_EnableUserNUMAManagement:int 2025-05-07T20:22:59.4963182Z parm: NVreg_MemoryPoolSize:int 2025-05-07T20:22:59.4963507Z parm: NVreg_KMallocHeapMaxSize:int 2025-05-07T20:22:59.4963842Z parm: NVreg_VMallocHeapMaxSize:int 2025-05-07T20:22:59.4964160Z parm: NVreg_IgnoreMMIOCheck:int 2025-05-07T20:22:59.4964473Z parm: NVreg_NvLinkDisable:int 2025-05-07T20:22:59.4964823Z parm: NVreg_EnablePCIERelaxedOrderingMode:int 2025-05-07T20:22:59.4965180Z parm: NVreg_RegisterPCIDriver:int 2025-05-07T20:22:59.4965512Z parm: NVreg_EnableResizableBar:int 2025-05-07T20:22:59.4965846Z parm: NVreg_EnableDbgBreakpoint:int 2025-05-07T20:22:59.4966189Z parm: NVreg_EnableNonblockingOpen:int 2025-05-07T20:22:59.4966616Z parm: NVreg_RegistryDwords:charp 2025-05-07T20:22:59.4966954Z parm: NVreg_RegistryDwordsPerDevice:charp 2025-05-07T20:22:59.4967287Z parm: NVreg_RmMsg:charp 2025-05-07T20:22:59.4967576Z parm: NVreg_GpuBlacklist:charp 2025-05-07T20:22:59.4967899Z parm: NVreg_TemporaryFilePath:charp 2025-05-07T20:22:59.4968225Z parm: NVreg_ExcludedGpus:charp 2025-05-07T20:22:59.4968535Z parm: NVreg_DmaRemapPeerMmio:int 2025-05-07T20:22:59.4968864Z parm: NVreg_RmNvlinkBandwidth:charp 2025-05-07T20:22:59.4969227Z parm: NVreg_RmNvlinkBandwidthLinkCount:int 2025-05-07T20:22:59.4969578Z parm: NVreg_ImexChannelCount:int 2025-05-07T20:22:59.4969898Z parm: NVreg_CreateImexChannel0:int 2025-05-07T20:22:59.4970244Z parm: NVreg_GrdmaPciTopoCheckOverride:int 2025-05-07T20:22:59.4970584Z parm: rm_firmware_active:charp 2025-05-07T20:22:59.4970865Z + set +e 2025-05-07T20:22:59.4971060Z + nvidia-smi 2025-05-07T20:23:00.9061907Z Wed May 7 20:23:00 2025 2025-05-07T20:23:00.9062321Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:23:00.9062882Z | NVIDIA-SMI 570.133.07 Driver Version: 570.133.07 CUDA Version: 12.8 | 2025-05-07T20:23:00.9063399Z |-----------------------------------------+------------------------+----------------------+ 2025-05-07T20:23:00.9063918Z | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | 2025-05-07T20:23:00.9064465Z | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | 2025-05-07T20:23:00.9064918Z | | | MIG M. | 2025-05-07T20:23:00.9065271Z |=========================================+========================+======================| 2025-05-07T20:23:00.9127124Z | 0 NVIDIA A10G Off | 00000000:00:1E.0 Off | 0 | 2025-05-07T20:23:00.9127635Z | 0% 29C P0 64W / 300W | 0MiB / 23028MiB | 4% Default | 2025-05-07T20:23:00.9128061Z | | | N/A | 2025-05-07T20:23:00.9128485Z +-----------------------------------------+------------------------+----------------------+ 2025-05-07T20:23:00.9128903Z 2025-05-07T20:23:00.9129317Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:23:00.9129773Z | Processes: | 2025-05-07T20:23:00.9130240Z | GPU GI CI PID Type Process name GPU Memory | 2025-05-07T20:23:00.9130670Z | ID ID Usage | 2025-05-07T20:23:00.9131030Z |=========================================================================================| 2025-05-07T20:23:00.9132226Z | No running processes found | 2025-05-07T20:23:00.9132985Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:23:01.3205590Z + nvidia-smi --query-gpu=gpu_name --format=csv,noheader --id=0 2025-05-07T20:23:02.7255230Z NVIDIA A10G 2025-05-07T20:23:02.9933279Z + NVIDIA_SMI_STATUS=0 2025-05-07T20:23:02.9933541Z + '[' 0 -eq 0 ']' 2025-05-07T20:23:02.9933791Z + echo 'INFO: Ignoring allowed status 0' 2025-05-07T20:23:02.9934090Z + set -e 2025-05-07T20:23:02.9934309Z INFO: Ignoring allowed status 0 2025-05-07T20:23:02.9942878Z == Installing nvidia container toolkit for amzn2023 == 2025-05-07T20:23:02.9947946Z + sudo yum install -y yum-utils 2025-05-07T20:23:03.4110418Z Last metadata expiration check: 0:07:09 ago on Wed May 7 20:15:54 2025. 2025-05-07T20:23:03.4359435Z Package dnf-utils-4.3.0-13.amzn2023.0.5.noarch is already installed. 2025-05-07T20:23:03.4756919Z Dependencies resolved. 2025-05-07T20:23:03.4938021Z Nothing to do. 2025-05-07T20:23:03.4938770Z Complete! 2025-05-07T20:23:03.5327224Z + [[ amzn2023 == \a\m\z\n\2\0\2\3 ]] 2025-05-07T20:23:03.5327822Z + YUM_REPO_URL=https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo 2025-05-07T20:23:03.5328731Z + sudo yum-config-manager --add-repo https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo 2025-05-07T20:23:03.8784717Z Adding repo from: https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo 2025-05-07T20:23:03.9338723Z + sudo yum install -y nvidia-docker2 nvidia-container-toolkit-1.16.2 2025-05-07T20:23:04.5199855Z nvidia-container-toolkit 14 kB/s | 833 B 00:00 2025-05-07T20:23:04.5448887Z Package nvidia-docker2-2.14.0-1.noarch is already installed. 2025-05-07T20:23:04.5846633Z Dependencies resolved. 2025-05-07T20:23:04.6025395Z ================================================================================ 2025-05-07T20:23:04.6026319Z Package Arch Version Repository Size 2025-05-07T20:23:04.6026762Z ================================================================================ 2025-05-07T20:23:04.6027068Z Downgrading: 2025-05-07T20:23:04.6027442Z nvidia-container-toolkit x86_64 1.16.2-1 nvidia-container-toolkit 1.2 M 2025-05-07T20:23:04.6028059Z nvidia-container-toolkit-base x86_64 1.16.2-1 nvidia-container-toolkit 5.6 M 2025-05-07T20:23:04.6028432Z 2025-05-07T20:23:04.6028533Z Transaction Summary 2025-05-07T20:23:04.6028774Z ================================================================================ 2025-05-07T20:23:04.6029092Z Downgrade 2 Packages 2025-05-07T20:23:04.6029240Z 2025-05-07T20:23:04.6029348Z Total download size: 6.8 M 2025-05-07T20:23:04.6030226Z Downloading Packages: 2025-05-07T20:23:04.6545148Z (1/2): nvidia-container-toolkit-1.16.2-1.x86_64 24 MB/s | 1.2 MB 00:00 2025-05-07T20:23:04.6962189Z (2/2): nvidia-container-toolkit-base-1.16.2-1.x 61 MB/s | 5.6 MB 00:00 2025-05-07T20:23:04.6970725Z -------------------------------------------------------------------------------- 2025-05-07T20:23:04.6973594Z Total 73 MB/s | 6.8 MB 00:00 2025-05-07T20:23:04.6976055Z Running transaction check 2025-05-07T20:23:04.7081708Z Transaction check succeeded. 2025-05-07T20:23:04.7082096Z Running transaction test 2025-05-07T20:23:04.7376602Z Transaction test succeeded. 2025-05-07T20:23:04.7378447Z Running transaction 2025-05-07T20:23:05.2907612Z Preparing : 1/1 2025-05-07T20:23:05.3964480Z Downgrading : nvidia-container-toolkit-base-1.16.2-1.x86_64 1/4 2025-05-07T20:23:05.3990124Z Downgrading : nvidia-container-toolkit-1.16.2-1.x86_64 2/4 2025-05-07T20:23:05.4201828Z Running scriptlet: nvidia-container-toolkit-1.16.2-1.x86_64 2/4 2025-05-07T20:23:05.4202536Z Cleanup : nvidia-container-toolkit-1.17.6-1.x86_64 3/4 2025-05-07T20:23:05.4305739Z Running scriptlet: nvidia-container-toolkit-1.17.6-1.x86_64 3/4 2025-05-07T20:23:05.4334365Z Cleanup : nvidia-container-toolkit-base-1.17.6-1.x86_64 4/4 2025-05-07T20:23:06.8404870Z Running scriptlet: nvidia-container-toolkit-1.16.2-1.x86_64 4/4 2025-05-07T20:23:06.8405816Z Verifying : nvidia-container-toolkit-1.16.2-1.x86_64 1/4 2025-05-07T20:23:06.8406683Z Verifying : nvidia-container-toolkit-1.17.6-1.x86_64 2/4 2025-05-07T20:23:06.8407568Z Verifying : nvidia-container-toolkit-base-1.16.2-1.x86_64 3/4 2025-05-07T20:23:06.9715291Z Verifying : nvidia-container-toolkit-base-1.17.6-1.x86_64 4/4================================================================================ 2025-05-07T20:23:06.9716132Z WARNING: 2025-05-07T20:23:06.9716371Z A newer release of "Amazon Linux" is available. 2025-05-07T20:23:06.9716630Z 2025-05-07T20:23:06.9716730Z Available Versions: 2025-05-07T20:23:06.9716901Z 2025-05-07T20:23:06.9717014Z Version 2023.7.20250331: 2025-05-07T20:23:06.9717320Z Run the following command to upgrade to 2023.7.20250331: 2025-05-07T20:23:06.9717590Z 2025-05-07T20:23:06.9717718Z dnf upgrade --releasever=2023.7.20250331 2025-05-07T20:23:06.9717940Z 2025-05-07T20:23:06.9718023Z Release notes: 2025-05-07T20:23:06.9718444Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250331.html 2025-05-07T20:23:06.9718832Z 2025-05-07T20:23:06.9718920Z Version 2023.7.20250414: 2025-05-07T20:23:06.9719239Z Run the following command to upgrade to 2023.7.20250414: 2025-05-07T20:23:06.9719499Z 2025-05-07T20:23:06.9719625Z dnf upgrade --releasever=2023.7.20250414 2025-05-07T20:23:06.9719842Z 2025-05-07T20:23:06.9719997Z Release notes: 2025-05-07T20:23:06.9720658Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250414.html 2025-05-07T20:23:06.9721074Z 2025-05-07T20:23:06.9721256Z Version 2023.7.20250428: 2025-05-07T20:23:06.9721628Z Run the following command to upgrade to 2023.7.20250428: 2025-05-07T20:23:06.9721992Z 2025-05-07T20:23:06.9722151Z dnf upgrade --releasever=2023.7.20250428 2025-05-07T20:23:06.9722395Z 2025-05-07T20:23:06.9722601Z Release notes: 2025-05-07T20:23:06.9723084Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250428.html 2025-05-07T20:23:06.9723547Z 2025-05-07T20:23:06.9734704Z ================================================================================ 2025-05-07T20:23:07.0080969Z 2025-05-07T20:23:07.0081175Z 2025-05-07T20:23:07.0081312Z Downgraded: 2025-05-07T20:23:07.0081706Z nvidia-container-toolkit-1.16.2-1.x86_64 2025-05-07T20:23:07.0082313Z nvidia-container-toolkit-base-1.16.2-1.x86_64 2025-05-07T20:23:07.0082696Z 2025-05-07T20:23:07.0082908Z Complete! 2025-05-07T20:23:07.0530792Z + sudo systemctl restart docker 2025-05-07T20:23:11.2009166Z Wed May 7 20:23:11 2025 2025-05-07T20:23:11.2009615Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:23:11.2010157Z | NVIDIA-SMI 570.133.07 Driver Version: 570.133.07 CUDA Version: 12.8 | 2025-05-07T20:23:11.2010673Z |-----------------------------------------+------------------------+----------------------+ 2025-05-07T20:23:11.2011200Z | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | 2025-05-07T20:23:11.2011760Z | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | 2025-05-07T20:23:11.2012221Z | | | MIG M. | 2025-05-07T20:23:11.2012672Z |=========================================+========================+======================| 2025-05-07T20:23:11.2091597Z | 0 NVIDIA A10G On | 00000000:00:1E.0 Off | 0 | 2025-05-07T20:23:11.2092421Z | 0% 30C P0 64W / 300W | 0MiB / 23028MiB | 4% Default | 2025-05-07T20:23:11.2092838Z | | | N/A | 2025-05-07T20:23:11.2093245Z +-----------------------------------------+------------------------+----------------------+ 2025-05-07T20:23:11.2093658Z 2025-05-07T20:23:11.2094061Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:23:11.2094502Z | Processes: | 2025-05-07T20:23:11.2094959Z | GPU GI CI PID Type Process name GPU Memory | 2025-05-07T20:23:11.2095532Z | ID ID Usage | 2025-05-07T20:23:11.2095888Z |=========================================================================================| 2025-05-07T20:23:11.2097193Z | No running processes found | 2025-05-07T20:23:11.2097687Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:23:11.8255575Z Command completed after 1 attempt(s). 2025-05-07T20:23:11.8340965Z ##[group]Run . $PRELUDE; print_system_info; print_ec2_info 2025-05-07T20:23:11.8341455Z . $PRELUDE; print_system_info; print_ec2_info 2025-05-07T20:23:11.8356312Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:23:11.8356683Z env: 2025-05-07T20:23:11.8356913Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:23:11.8357213Z BUILD_ENV: build_binary 2025-05-07T20:23:11.8357459Z BUILD_TARGET: genai 2025-05-07T20:23:11.8357695Z BUILD_VARIANT: cuda 2025-05-07T20:23:11.8357945Z BUILD_CUDA_VERSION: 12.8.0 2025-05-07T20:23:11.8358229Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:23:11.8358534Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:23:11.8358868Z ##[endgroup] 2025-05-07T20:23:12.1703817Z ################################################################################ 2025-05-07T20:23:12.1704170Z # Print System Info 2025-05-07T20:23:12.1704382Z # 2025-05-07T20:23:12.1720170Z # [2025-05-07T20:23:12.171Z] + print_system_info 2025-05-07T20:23:12.1720519Z ################################################################################ 2025-05-07T20:23:12.1720769Z 2025-05-07T20:23:12.1720881Z ################################################################################ 2025-05-07T20:23:12.1721217Z [INFO] Printing environment variables ... 2025-05-07T20:23:12.1721516Z + printenv 2025-05-07T20:23:12.1721627Z 2025-05-07T20:23:12.1743383Z SHELL=/bin/bash 2025-05-07T20:23:12.1743742Z GITHUB_WORKSPACE=/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM 2025-05-07T20:23:12.1744302Z BUILD_VARIANT=cuda 2025-05-07T20:23:12.1744940Z GITHUB_PATH=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/add_path_c71ea148-c953-4dc1-a8bb-b70fcbecd39b 2025-05-07T20:23:12.1745653Z GITHUB_ACTION=__run 2025-05-07T20:23:12.1745949Z GPU_FLAG=--gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:23:12.1746299Z GITHUB_RUN_NUMBER=10601 2025-05-07T20:23:12.1746545Z RUNNER_NAME=i-0efa96680de6b8d22 2025-05-07T20:23:12.1746834Z GITHUB_REPOSITORY_OWNER_ID=21003710 2025-05-07T20:23:12.1747148Z PLATFORM_NAME_LC=linux-x86_64 2025-05-07T20:23:12.1747409Z MACHINE_NAME_LC=x86_64 2025-05-07T20:23:12.1747787Z ACTIONS_RUNNER_HOOK_JOB_COMPLETED=/home/ec2-user/runner-scripts/after_job.sh 2025-05-07T20:23:12.1748241Z GITHUB_TRIGGERING_ACTOR=q10 2025-05-07T20:23:12.1748519Z PRELUDE=.github/scripts/setup_env.bash 2025-05-07T20:23:12.1748807Z GITHUB_REF_TYPE=branch 2025-05-07T20:23:12.1749350Z *** 2025-05-07T20:23:12.1749557Z LOGNAME=ec2-user 2025-05-07T20:23:12.1749790Z GITHUB_REPOSITORY_ID=150154628 2025-05-07T20:23:12.1750205Z ENFORCE_CUDA_DEVICE=1 2025-05-07T20:23:12.1750437Z GITHUB_ACTIONS=true 2025-05-07T20:23:12.1750658Z SYSTEMD_EXEC_PID=55516 2025-05-07T20:23:12.1750934Z GITHUB_SHA=a2f4c52051596e74bc8c16e3d2867a4ecdd271e0 2025-05-07T20:23:12.1751503Z GITHUB_WORKFLOW_REF=pytorch/FBGEMM/.github/workflows/fbgemm_gpu_ci_cuda.yml@refs/pull/4066/merge 2025-05-07T20:23:12.1752035Z RUNNER_ENVIRONMENT=self-hosted 2025-05-07T20:23:12.1752311Z GITHUB_REF=refs/pull/4066/merge 2025-05-07T20:23:12.1752575Z RUNNER_OS=Linux 2025-05-07T20:23:12.1752799Z GITHUB_REF_PROTECTED=false 2025-05-07T20:23:12.1753044Z HOME=/home/ec2-user 2025-05-07T20:23:12.1753290Z GITHUB_API_URL=https://api.github.com 2025-05-07T20:23:12.1753587Z LANG=C.UTF-8 2025-05-07T20:23:12.1753881Z RUNNER_TRACKING_ID=github_f62a49ac-39e7-4f59-b3ca-31d00a76a701 2025-05-07T20:23:12.1754247Z RUNNER_ARCH=X64 2025-05-07T20:23:12.1754512Z RUNNER_TEMP=/home/ec2-user/actions-runner/_work/_temp 2025-05-07T20:23:12.1755199Z BUILD_TARGET=genai 2025-05-07T20:23:12.1755753Z GITHUB_STATE=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/save_state_c71ea148-c953-4dc1-a8bb-b70fcbecd39b 2025-05-07T20:23:12.1756678Z GITHUB_ENV=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/set_env_c71ea148-c953-4dc1-a8bb-b70fcbecd39b 2025-05-07T20:23:12.1757452Z GITHUB_EVENT_PATH=/home/ec2-user/actions-runner/_work/_temp/_github_workflow/event.json 2025-05-07T20:23:12.1758156Z INVOCATION_ID=aab5aa5f0aac458c98e693b092c8fb0e 2025-05-07T20:23:12.1758491Z GITHUB_EVENT_NAME=pull_request 2025-05-07T20:23:12.1758749Z GITHUB_RUN_ID=14891846252 2025-05-07T20:23:12.1759361Z GITHUB_STEP_SUMMARY=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/step_summary_c71ea148-c953-4dc1-a8bb-b70fcbecd39b 2025-05-07T20:23:12.1760058Z BUILD_ENV=build_binary 2025-05-07T20:23:12.1760288Z GITHUB_ACTOR=q10 2025-05-07T20:23:12.1760498Z GITHUB_RUN_ATTEMPT=1 2025-05-07T20:23:12.1760720Z KERN_NAME_LC=linux 2025-05-07T20:23:12.1760948Z BUILD_CUDA_VERSION=12.8.0 2025-05-07T20:23:12.1761240Z GITHUB_GRAPHQL_URL=https://api.github.com/graphql 2025-05-07T20:23:12.1761586Z PLATFORM_NAME=Linux-x86_64 2025-05-07T20:23:12.1761838Z USER=ec2-user 2025-05-07T20:23:12.1762063Z GITHUB_SERVER_URL=https://github.com 2025-05-07T20:23:12.1762344Z SHLVL=1 2025-05-07T20:23:12.1762531Z GITHUB_ACTOR_ID=255046 2025-05-07T20:23:12.1762841Z RUNNER_TOOL_CACHE=/home/ec2-user/actions-runner/_work/_tool 2025-05-07T20:23:12.1763298Z GITHUB_WORKFLOW_SHA=6060cd4b5f971680caecdcc657faccb5720d1c3e 2025-05-07T20:23:12.1763675Z GITHUB_REF_NAME=4066/merge 2025-05-07T20:23:12.1763905Z KERN_NAME=Linux 2025-05-07T20:23:12.1764132Z GITHUB_JOB=test_and_publish_artifact 2025-05-07T20:23:12.1764547Z ACTIONS_RUNNER_HOOK_JOB_STARTED=/home/ec2-user/runner-scripts/before_job.sh 2025-05-07T20:23:12.1764990Z GITHUB_REPOSITORY=pytorch/FBGEMM 2025-05-07T20:23:12.1765262Z GITHUB_RETENTION_DAYS=90 2025-05-07T20:23:12.1765511Z JOURNAL_STREAM=8:82613 2025-05-07T20:23:12.1765834Z RUNNER_WORKSPACE=/home/ec2-user/actions-runner/_work/FBGEMM 2025-05-07T20:23:12.1766203Z GITHUB_ACTION_REPOSITORY= 2025-05-07T20:23:12.1766513Z PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin 2025-05-07T20:23:12.1766853Z GITHUB_BASE_REF=main 2025-05-07T20:23:12.1767064Z CI=true 2025-05-07T20:23:12.1767266Z GITHUB_REPOSITORY_OWNER=pytorch 2025-05-07T20:23:12.1767549Z GITHUB_HEAD_REF=bm/genai-rocm-oss-6 2025-05-07T20:23:12.1767826Z GITHUB_ACTION_REF= 2025-05-07T20:23:12.1768076Z GITHUB_WORKFLOW=FBGEMM GPU/GenAI CUDA CI 2025-05-07T20:23:12.1768718Z GITHUB_OUTPUT=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/set_output_c71ea148-c953-4dc1-a8bb-b70fcbecd39b 2025-05-07T20:23:12.1769333Z MACHINE_NAME=x86_64 2025-05-07T20:23:12.1769554Z _=/usr/bin/printenv 2025-05-07T20:23:12.1769693Z 2025-05-07T20:23:12.1769809Z ################################################################################ 2025-05-07T20:23:12.1770135Z [INFO] Print ldd version ... 2025-05-07T20:23:12.1770388Z + ldd --version 2025-05-07T20:23:12.1770521Z 2025-05-07T20:23:12.1770621Z ldd (GNU libc) 2.34 2025-05-07T20:23:12.1770885Z Copyright (C) 2021 Free Software Foundation, Inc. 2025-05-07T20:23:12.1771334Z This is free software; see the source for copying conditions. There is NO 2025-05-07T20:23:12.1771887Z warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. 2025-05-07T20:23:12.1772346Z Written by Roland McGrath and Ulrich Drepper. 2025-05-07T20:23:12.1772569Z 2025-05-07T20:23:12.1772689Z ################################################################################ 2025-05-07T20:23:12.1773000Z [INFO] Print CPU info ... 2025-05-07T20:23:12.1773237Z + nproc 2025-05-07T20:23:12.1773343Z 2025-05-07T20:23:12.1789722Z 16 2025-05-07T20:23:12.1791662Z 2025-05-07T20:23:12.1791903Z + lscpu 2025-05-07T20:23:12.1792012Z 2025-05-07T20:23:12.1905591Z Architecture: x86_64 2025-05-07T20:23:12.1906097Z CPU op-mode(s): 32-bit, 64-bit 2025-05-07T20:23:12.1906790Z Address sizes: 48 bits physical, 48 bits virtual 2025-05-07T20:23:12.1907194Z Byte Order: Little Endian 2025-05-07T20:23:12.1907523Z CPU(s): 16 2025-05-07T20:23:12.1907820Z On-line CPU(s) list: 0-15 2025-05-07T20:23:12.1908148Z Vendor ID: AuthenticAMD 2025-05-07T20:23:12.1908496Z Model name: AMD EPYC 7R32 2025-05-07T20:23:12.1908809Z CPU family: 23 2025-05-07T20:23:12.1909253Z Model: 49 2025-05-07T20:23:12.1909551Z Thread(s) per core: 2 2025-05-07T20:23:12.1909836Z Core(s) per socket: 8 2025-05-07T20:23:12.1910287Z Socket(s): 1 2025-05-07T20:23:12.1910576Z Stepping: 0 2025-05-07T20:23:12.1910885Z BogoMIPS: 5600.00 2025-05-07T20:23:12.1913153Z Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:12.1915442Z Hypervisor vendor: KVM 2025-05-07T20:23:12.1915751Z Virtualization type: full 2025-05-07T20:23:12.1916096Z L1d cache: 256 KiB (8 instances) 2025-05-07T20:23:12.1916477Z L1i cache: 256 KiB (8 instances) 2025-05-07T20:23:12.1916838Z L2 cache: 4 MiB (8 instances) 2025-05-07T20:23:12.1917204Z L3 cache: 32 MiB (2 instances) 2025-05-07T20:23:12.1917534Z NUMA node(s): 1 2025-05-07T20:23:12.1917832Z NUMA node0 CPU(s): 0-15 2025-05-07T20:23:12.1918167Z Vulnerability Gather data sampling: Not affected 2025-05-07T20:23:12.1918550Z Vulnerability Itlb multihit: Not affected 2025-05-07T20:23:12.1918943Z Vulnerability L1tf: Not affected 2025-05-07T20:23:12.1919300Z Vulnerability Mds: Not affected 2025-05-07T20:23:12.1919672Z Vulnerability Meltdown: Not affected 2025-05-07T20:23:12.1920047Z Vulnerability Mmio stale data: Not affected 2025-05-07T20:23:12.1920413Z Vulnerability Reg file data sampling: Not affected 2025-05-07T20:23:12.1920993Z Vulnerability Retbleed: Mitigation; untrained return thunk; SMT enabled with STIBP protection 2025-05-07T20:23:12.1921723Z Vulnerability Spec rstack overflow: Mitigation; safe RET 2025-05-07T20:23:12.1922482Z Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl 2025-05-07T20:23:12.1923210Z Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization 2025-05-07T20:23:12.1924390Z Vulnerability Spectre v2: Mitigation; Retpolines; IBPB conditional; STIBP always-on; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected 2025-05-07T20:23:12.1925276Z Vulnerability Srbds: Not affected 2025-05-07T20:23:12.1925659Z Vulnerability Tsx async abort: Not affected 2025-05-07T20:23:12.1925991Z 2025-05-07T20:23:12.1926084Z + cat /proc/cpuinfo 2025-05-07T20:23:12.1926220Z 2025-05-07T20:23:12.1926313Z processor : 0 2025-05-07T20:23:12.1926529Z vendor_id : AuthenticAMD 2025-05-07T20:23:12.1926772Z cpu family : 23 2025-05-07T20:23:12.1926984Z model : 49 2025-05-07T20:23:12.1927189Z model name : AMD EPYC 7R32 2025-05-07T20:23:12.1927437Z stepping : 0 2025-05-07T20:23:12.1927652Z microcode : 0x830107f 2025-05-07T20:23:12.1927977Z cpu MHz : 3290.947 2025-05-07T20:23:12.1928190Z cache size : 512 KB 2025-05-07T20:23:12.1928402Z physical id : 0 2025-05-07T20:23:12.1928602Z siblings : 16 2025-05-07T20:23:12.1928801Z core id : 0 2025-05-07T20:23:12.1928995Z cpu cores : 8 2025-05-07T20:23:12.1929187Z apicid : 0 2025-05-07T20:23:12.1929388Z initial apicid : 0 2025-05-07T20:23:12.1929595Z fpu : yes 2025-05-07T20:23:12.1929790Z fpu_exception : yes 2025-05-07T20:23:12.1930008Z cpuid level : 13 2025-05-07T20:23:12.1930214Z wp : yes 2025-05-07T20:23:12.1932450Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:12.1934907Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:12.1935419Z bogomips : 5600.00 2025-05-07T20:23:12.1935647Z TLB size : 3072 4K pages 2025-05-07T20:23:12.1935882Z clflush size : 64 2025-05-07T20:23:12.1936091Z cache_alignment : 64 2025-05-07T20:23:12.1936361Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:12.1936695Z power management: 2025-05-07T20:23:12.1936826Z 2025-05-07T20:23:12.1936913Z processor : 1 2025-05-07T20:23:12.1937132Z vendor_id : AuthenticAMD 2025-05-07T20:23:12.1937371Z cpu family : 23 2025-05-07T20:23:12.1937569Z model : 49 2025-05-07T20:23:12.1937774Z model name : AMD EPYC 7R32 2025-05-07T20:23:12.1938023Z stepping : 0 2025-05-07T20:23:12.1938222Z microcode : 0x830107f 2025-05-07T20:23:12.1938451Z cpu MHz : 2829.392 2025-05-07T20:23:12.1938663Z cache size : 512 KB 2025-05-07T20:23:12.1938874Z physical id : 0 2025-05-07T20:23:12.1939082Z siblings : 16 2025-05-07T20:23:12.1939280Z core id : 1 2025-05-07T20:23:12.1939469Z cpu cores : 8 2025-05-07T20:23:12.1939667Z apicid : 2 2025-05-07T20:23:12.1939859Z initial apicid : 2 2025-05-07T20:23:12.1940073Z fpu : yes 2025-05-07T20:23:12.1940264Z fpu_exception : yes 2025-05-07T20:23:12.1940479Z cpuid level : 13 2025-05-07T20:23:12.1940684Z wp : yes 2025-05-07T20:23:12.1942820Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:12.1945256Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:12.1945770Z bogomips : 5600.00 2025-05-07T20:23:12.1945987Z TLB size : 3072 4K pages 2025-05-07T20:23:12.1946216Z clflush size : 64 2025-05-07T20:23:12.1946428Z cache_alignment : 64 2025-05-07T20:23:12.1946694Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:12.1947012Z power management: 2025-05-07T20:23:12.1947148Z 2025-05-07T20:23:12.1947232Z processor : 2 2025-05-07T20:23:12.1947443Z vendor_id : AuthenticAMD 2025-05-07T20:23:12.1947680Z cpu family : 23 2025-05-07T20:23:12.1947876Z model : 49 2025-05-07T20:23:12.1948101Z model name : AMD EPYC 7R32 2025-05-07T20:23:12.1948363Z stepping : 0 2025-05-07T20:23:12.1948561Z microcode : 0x830107f 2025-05-07T20:23:12.1948784Z cpu MHz : 3303.588 2025-05-07T20:23:12.1948996Z cache size : 512 KB 2025-05-07T20:23:12.1949200Z physical id : 0 2025-05-07T20:23:12.1949405Z siblings : 16 2025-05-07T20:23:12.1949688Z core id : 2 2025-05-07T20:23:12.1949883Z cpu cores : 8 2025-05-07T20:23:12.1950223Z apicid : 4 2025-05-07T20:23:12.1950416Z initial apicid : 4 2025-05-07T20:23:12.1950626Z fpu : yes 2025-05-07T20:23:12.1950817Z fpu_exception : yes 2025-05-07T20:23:12.1951031Z cpuid level : 13 2025-05-07T20:23:12.1951237Z wp : yes 2025-05-07T20:23:12.1953483Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:12.1955916Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:12.1956426Z bogomips : 5600.00 2025-05-07T20:23:12.1956646Z TLB size : 3072 4K pages 2025-05-07T20:23:12.1956886Z clflush size : 64 2025-05-07T20:23:12.1957094Z cache_alignment : 64 2025-05-07T20:23:12.1957379Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:12.1957708Z power management: 2025-05-07T20:23:12.1957839Z 2025-05-07T20:23:12.1957920Z processor : 3 2025-05-07T20:23:12.1958136Z vendor_id : AuthenticAMD 2025-05-07T20:23:12.1958375Z cpu family : 23 2025-05-07T20:23:12.1958580Z model : 49 2025-05-07T20:23:12.1958783Z model name : AMD EPYC 7R32 2025-05-07T20:23:12.1959027Z stepping : 0 2025-05-07T20:23:12.1959233Z microcode : 0x830107f 2025-05-07T20:23:12.1959450Z cpu MHz : 3162.450 2025-05-07T20:23:12.1959663Z cache size : 512 KB 2025-05-07T20:23:12.1959880Z physical id : 0 2025-05-07T20:23:12.1960080Z siblings : 16 2025-05-07T20:23:12.1960277Z core id : 3 2025-05-07T20:23:12.1960479Z cpu cores : 8 2025-05-07T20:23:12.1960670Z apicid : 6 2025-05-07T20:23:12.1960866Z initial apicid : 6 2025-05-07T20:23:12.1961071Z fpu : yes 2025-05-07T20:23:12.1961261Z fpu_exception : yes 2025-05-07T20:23:12.1961477Z cpuid level : 13 2025-05-07T20:23:12.1961684Z wp : yes 2025-05-07T20:23:12.1963852Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:12.1980178Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:12.1980732Z bogomips : 5600.00 2025-05-07T20:23:12.1980960Z TLB size : 3072 4K pages 2025-05-07T20:23:12.1981210Z clflush size : 64 2025-05-07T20:23:12.1981434Z cache_alignment : 64 2025-05-07T20:23:12.1981709Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:12.1982042Z power management: 2025-05-07T20:23:12.1982176Z 2025-05-07T20:23:12.1982270Z processor : 4 2025-05-07T20:23:12.1982483Z vendor_id : AuthenticAMD 2025-05-07T20:23:12.1982728Z cpu family : 23 2025-05-07T20:23:12.1983253Z model : 49 2025-05-07T20:23:12.1983486Z model name : AMD EPYC 7R32 2025-05-07T20:23:12.1983876Z stepping : 0 2025-05-07T20:23:12.1984089Z microcode : 0x830107f 2025-05-07T20:23:12.1984314Z cpu MHz : 3304.110 2025-05-07T20:23:12.1984523Z cache size : 512 KB 2025-05-07T20:23:12.1984739Z physical id : 0 2025-05-07T20:23:12.1984949Z siblings : 16 2025-05-07T20:23:12.1985144Z core id : 4 2025-05-07T20:23:12.1985344Z cpu cores : 8 2025-05-07T20:23:12.1985545Z apicid : 8 2025-05-07T20:23:12.1985908Z initial apicid : 8 2025-05-07T20:23:12.1986119Z fpu : yes 2025-05-07T20:23:12.1986320Z fpu_exception : yes 2025-05-07T20:23:12.1986528Z cpuid level : 13 2025-05-07T20:23:12.1986738Z wp : yes 2025-05-07T20:23:12.1988994Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:12.1991562Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:12.1992072Z bogomips : 5600.00 2025-05-07T20:23:12.1992301Z TLB size : 3072 4K pages 2025-05-07T20:23:12.1992543Z clflush size : 64 2025-05-07T20:23:12.1992756Z cache_alignment : 64 2025-05-07T20:23:12.1993026Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:12.1993354Z power management: 2025-05-07T20:23:12.1993483Z 2025-05-07T20:23:12.1993577Z processor : 5 2025-05-07T20:23:12.1993781Z vendor_id : AuthenticAMD 2025-05-07T20:23:12.1994035Z cpu family : 23 2025-05-07T20:23:12.1994323Z model : 49 2025-05-07T20:23:12.1994582Z model name : AMD EPYC 7R32 2025-05-07T20:23:12.1994845Z stepping : 0 2025-05-07T20:23:12.1995060Z microcode : 0x830107f 2025-05-07T20:23:12.1995285Z cpu MHz : 3299.789 2025-05-07T20:23:12.1995505Z cache size : 512 KB 2025-05-07T20:23:12.1995726Z physical id : 0 2025-05-07T20:23:12.1995932Z siblings : 16 2025-05-07T20:23:12.1996147Z core id : 5 2025-05-07T20:23:12.1996426Z cpu cores : 8 2025-05-07T20:23:12.1996700Z apicid : 10 2025-05-07T20:23:12.1996995Z initial apicid : 10 2025-05-07T20:23:12.1997295Z fpu : yes 2025-05-07T20:23:12.1997582Z fpu_exception : yes 2025-05-07T20:23:12.1997888Z cpuid level : 13 2025-05-07T20:23:12.1998256Z wp : yes 2025-05-07T20:23:12.2001371Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:12.2004200Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:12.2004730Z bogomips : 5600.00 2025-05-07T20:23:12.2004967Z TLB size : 3072 4K pages 2025-05-07T20:23:12.2005217Z clflush size : 64 2025-05-07T20:23:12.2005444Z cache_alignment : 64 2025-05-07T20:23:12.2005728Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:12.2006069Z power management: 2025-05-07T20:23:12.2006209Z 2025-05-07T20:23:12.2006306Z processor : 6 2025-05-07T20:23:12.2006521Z vendor_id : AuthenticAMD 2025-05-07T20:23:12.2006773Z cpu family : 23 2025-05-07T20:23:12.2006991Z model : 49 2025-05-07T20:23:12.2007196Z model name : AMD EPYC 7R32 2025-05-07T20:23:12.2007434Z stepping : 0 2025-05-07T20:23:12.2007640Z microcode : 0x830107f 2025-05-07T20:23:12.2007870Z cpu MHz : 3164.899 2025-05-07T20:23:12.2008095Z cache size : 512 KB 2025-05-07T20:23:12.2008318Z physical id : 0 2025-05-07T20:23:12.2008527Z siblings : 16 2025-05-07T20:23:12.2008736Z core id : 6 2025-05-07T20:23:12.2008944Z cpu cores : 8 2025-05-07T20:23:12.2009150Z apicid : 12 2025-05-07T20:23:12.2009363Z initial apicid : 12 2025-05-07T20:23:12.2009585Z fpu : yes 2025-05-07T20:23:12.2009783Z fpu_exception : yes 2025-05-07T20:23:12.2010016Z cpuid level : 13 2025-05-07T20:23:12.2010356Z wp : yes 2025-05-07T20:23:12.2012581Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:12.2015029Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:12.2015542Z bogomips : 5600.00 2025-05-07T20:23:12.2015769Z TLB size : 3072 4K pages 2025-05-07T20:23:12.2016012Z clflush size : 64 2025-05-07T20:23:12.2016224Z cache_alignment : 64 2025-05-07T20:23:12.2016499Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:12.2016823Z power management: 2025-05-07T20:23:12.2016952Z 2025-05-07T20:23:12.2017032Z processor : 7 2025-05-07T20:23:12.2017248Z vendor_id : AuthenticAMD 2025-05-07T20:23:12.2017487Z cpu family : 23 2025-05-07T20:23:12.2017688Z model : 49 2025-05-07T20:23:12.2017908Z model name : AMD EPYC 7R32 2025-05-07T20:23:12.2018168Z stepping : 0 2025-05-07T20:23:12.2018378Z microcode : 0x830107f 2025-05-07T20:23:12.2018613Z cpu MHz : 3216.010 2025-05-07T20:23:12.2018847Z cache size : 512 KB 2025-05-07T20:23:12.2019060Z physical id : 0 2025-05-07T20:23:12.2019276Z siblings : 16 2025-05-07T20:23:12.2019482Z core id : 7 2025-05-07T20:23:12.2019682Z cpu cores : 8 2025-05-07T20:23:12.2019891Z apicid : 14 2025-05-07T20:23:12.2020103Z initial apicid : 14 2025-05-07T20:23:12.2020321Z fpu : yes 2025-05-07T20:23:12.2020528Z fpu_exception : yes 2025-05-07T20:23:12.2020758Z cpuid level : 13 2025-05-07T20:23:12.2020969Z wp : yes 2025-05-07T20:23:12.2023116Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:12.2025558Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:12.2026069Z bogomips : 5600.00 2025-05-07T20:23:12.2026280Z TLB size : 3072 4K pages 2025-05-07T20:23:12.2026514Z clflush size : 64 2025-05-07T20:23:12.2026726Z cache_alignment : 64 2025-05-07T20:23:12.2026989Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:12.2027313Z power management: 2025-05-07T20:23:12.2027444Z 2025-05-07T20:23:12.2027528Z processor : 8 2025-05-07T20:23:12.2027737Z vendor_id : AuthenticAMD 2025-05-07T20:23:12.2027975Z cpu family : 23 2025-05-07T20:23:12.2028177Z model : 49 2025-05-07T20:23:12.2028379Z model name : AMD EPYC 7R32 2025-05-07T20:23:12.2028626Z stepping : 0 2025-05-07T20:23:12.2028840Z microcode : 0x830107f 2025-05-07T20:23:12.2029062Z cpu MHz : 3297.080 2025-05-07T20:23:12.2029282Z cache size : 512 KB 2025-05-07T20:23:12.2029500Z physical id : 0 2025-05-07T20:23:12.2029707Z siblings : 16 2025-05-07T20:23:12.2029903Z core id : 0 2025-05-07T20:23:12.2030254Z cpu cores : 8 2025-05-07T20:23:12.2030452Z apicid : 1 2025-05-07T20:23:12.2030653Z initial apicid : 1 2025-05-07T20:23:12.2030870Z fpu : yes 2025-05-07T20:23:12.2031064Z fpu_exception : yes 2025-05-07T20:23:12.2031288Z cpuid level : 13 2025-05-07T20:23:12.2031500Z wp : yes 2025-05-07T20:23:12.2033634Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:12.2036281Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:12.2036787Z bogomips : 5600.00 2025-05-07T20:23:12.2037002Z TLB size : 3072 4K pages 2025-05-07T20:23:12.2037237Z clflush size : 64 2025-05-07T20:23:12.2037443Z cache_alignment : 64 2025-05-07T20:23:12.2037711Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:12.2038032Z power management: 2025-05-07T20:23:12.2038162Z 2025-05-07T20:23:12.2038248Z processor : 9 2025-05-07T20:23:12.2038464Z vendor_id : AuthenticAMD 2025-05-07T20:23:12.2038708Z cpu family : 23 2025-05-07T20:23:12.2038914Z model : 49 2025-05-07T20:23:12.2039124Z model name : AMD EPYC 7R32 2025-05-07T20:23:12.2039374Z stepping : 0 2025-05-07T20:23:12.2039579Z microcode : 0x830107f 2025-05-07T20:23:12.2039809Z cpu MHz : 3208.471 2025-05-07T20:23:12.2040027Z cache size : 512 KB 2025-05-07T20:23:12.2040300Z physical id : 0 2025-05-07T20:23:12.2040513Z siblings : 16 2025-05-07T20:23:12.2040719Z core id : 1 2025-05-07T20:23:12.2040930Z cpu cores : 8 2025-05-07T20:23:12.2041125Z apicid : 3 2025-05-07T20:23:12.2041328Z initial apicid : 3 2025-05-07T20:23:12.2041545Z fpu : yes 2025-05-07T20:23:12.2041739Z fpu_exception : yes 2025-05-07T20:23:12.2041960Z cpuid level : 13 2025-05-07T20:23:12.2042175Z wp : yes 2025-05-07T20:23:12.2044319Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:12.2046776Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:12.2047296Z bogomips : 5600.00 2025-05-07T20:23:12.2047541Z TLB size : 3072 4K pages 2025-05-07T20:23:12.2047874Z clflush size : 64 2025-05-07T20:23:12.2048177Z cache_alignment : 64 2025-05-07T20:23:12.2048559Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:12.2048999Z power management: 2025-05-07T20:23:12.2049180Z 2025-05-07T20:23:12.2049278Z processor : 10 2025-05-07T20:23:12.2049498Z vendor_id : AuthenticAMD 2025-05-07T20:23:12.2049746Z cpu family : 23 2025-05-07T20:23:12.2049950Z model : 49 2025-05-07T20:23:12.2050160Z model name : AMD EPYC 7R32 2025-05-07T20:23:12.2050412Z stepping : 0 2025-05-07T20:23:12.2050615Z microcode : 0x830107f 2025-05-07T20:23:12.2050842Z cpu MHz : 3306.026 2025-05-07T20:23:12.2051060Z cache size : 512 KB 2025-05-07T20:23:12.2051277Z physical id : 0 2025-05-07T20:23:12.2051489Z siblings : 16 2025-05-07T20:23:12.2051695Z core id : 2 2025-05-07T20:23:12.2051890Z cpu cores : 8 2025-05-07T20:23:12.2052099Z apicid : 5 2025-05-07T20:23:12.2052384Z initial apicid : 5 2025-05-07T20:23:12.2052679Z fpu : yes 2025-05-07T20:23:12.2052946Z fpu_exception : yes 2025-05-07T20:23:12.2053239Z cpuid level : 13 2025-05-07T20:23:12.2053513Z wp : yes 2025-05-07T20:23:12.2055704Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:12.2058709Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:12.2059288Z bogomips : 5600.00 2025-05-07T20:23:12.2059607Z TLB size : 3072 4K pages 2025-05-07T20:23:12.2059834Z clflush size : 64 2025-05-07T20:23:12.2060042Z cache_alignment : 64 2025-05-07T20:23:12.2060311Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:12.2060622Z power management: 2025-05-07T20:23:12.2060757Z 2025-05-07T20:23:12.2060838Z processor : 11 2025-05-07T20:23:12.2061052Z vendor_id : AuthenticAMD 2025-05-07T20:23:12.2061279Z cpu family : 23 2025-05-07T20:23:12.2061483Z model : 49 2025-05-07T20:23:12.2061696Z model name : AMD EPYC 7R32 2025-05-07T20:23:12.2061925Z stepping : 0 2025-05-07T20:23:12.2062120Z microcode : 0x830107f 2025-05-07T20:23:12.2062342Z cpu MHz : 3020.382 2025-05-07T20:23:12.2062544Z cache size : 512 KB 2025-05-07T20:23:12.2062753Z physical id : 0 2025-05-07T20:23:12.2063037Z siblings : 16 2025-05-07T20:23:12.2063301Z core id : 3 2025-05-07T20:23:12.2063568Z cpu cores : 8 2025-05-07T20:23:12.2063834Z apicid : 7 2025-05-07T20:23:12.2064072Z initial apicid : 7 2025-05-07T20:23:12.2064292Z fpu : yes 2025-05-07T20:23:12.2064492Z fpu_exception : yes 2025-05-07T20:23:12.2064735Z cpuid level : 13 2025-05-07T20:23:12.2065017Z wp : yes 2025-05-07T20:23:12.2067426Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:12.2069873Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:12.2070558Z bogomips : 5600.00 2025-05-07T20:23:12.2070766Z TLB size : 3072 4K pages 2025-05-07T20:23:12.2071001Z clflush size : 64 2025-05-07T20:23:12.2071218Z cache_alignment : 64 2025-05-07T20:23:12.2071482Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:12.2071804Z power management: 2025-05-07T20:23:12.2071933Z 2025-05-07T20:23:12.2072020Z processor : 12 2025-05-07T20:23:12.2072229Z vendor_id : AuthenticAMD 2025-05-07T20:23:12.2072466Z cpu family : 23 2025-05-07T20:23:12.2072668Z model : 49 2025-05-07T20:23:12.2072866Z model name : AMD EPYC 7R32 2025-05-07T20:23:12.2073105Z stepping : 0 2025-05-07T20:23:12.2073306Z microcode : 0x830107f 2025-05-07T20:23:12.2073523Z cpu MHz : 3310.035 2025-05-07T20:23:12.2073726Z cache size : 512 KB 2025-05-07T20:23:12.2073936Z physical id : 0 2025-05-07T20:23:12.2074135Z siblings : 16 2025-05-07T20:23:12.2074326Z core id : 4 2025-05-07T20:23:12.2074519Z cpu cores : 8 2025-05-07T20:23:12.2074720Z apicid : 9 2025-05-07T20:23:12.2074908Z initial apicid : 9 2025-05-07T20:23:12.2075171Z fpu : yes 2025-05-07T20:23:12.2075442Z fpu_exception : yes 2025-05-07T20:23:12.2075734Z cpuid level : 13 2025-05-07T20:23:12.2076014Z wp : yes 2025-05-07T20:23:12.2078553Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:12.2081121Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:12.2081626Z bogomips : 5600.00 2025-05-07T20:23:12.2081841Z TLB size : 3072 4K pages 2025-05-07T20:23:12.2082077Z clflush size : 64 2025-05-07T20:23:12.2082284Z cache_alignment : 64 2025-05-07T20:23:12.2082646Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:12.2083223Z power management: 2025-05-07T20:23:12.2083356Z 2025-05-07T20:23:12.2083442Z processor : 13 2025-05-07T20:23:12.2083648Z vendor_id : AuthenticAMD 2025-05-07T20:23:12.2083884Z cpu family : 23 2025-05-07T20:23:12.2084088Z model : 49 2025-05-07T20:23:12.2084283Z model name : AMD EPYC 7R32 2025-05-07T20:23:12.2084526Z stepping : 0 2025-05-07T20:23:12.2084729Z microcode : 0x830107f 2025-05-07T20:23:12.2084949Z cpu MHz : 3299.717 2025-05-07T20:23:12.2085155Z cache size : 512 KB 2025-05-07T20:23:12.2085367Z physical id : 0 2025-05-07T20:23:12.2085563Z siblings : 16 2025-05-07T20:23:12.2085758Z core id : 5 2025-05-07T20:23:12.2085957Z cpu cores : 8 2025-05-07T20:23:12.2086145Z apicid : 11 2025-05-07T20:23:12.2086340Z initial apicid : 11 2025-05-07T20:23:12.2086544Z fpu : yes 2025-05-07T20:23:12.2086732Z fpu_exception : yes 2025-05-07T20:23:12.2086941Z cpuid level : 13 2025-05-07T20:23:12.2087141Z wp : yes 2025-05-07T20:23:12.2089622Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:12.2092455Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:12.2092969Z bogomips : 5600.00 2025-05-07T20:23:12.2093192Z TLB size : 3072 4K pages 2025-05-07T20:23:12.2093432Z clflush size : 64 2025-05-07T20:23:12.2093644Z cache_alignment : 64 2025-05-07T20:23:12.2093918Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:12.2094244Z power management: 2025-05-07T20:23:12.2094375Z 2025-05-07T20:23:12.2094461Z processor : 14 2025-05-07T20:23:12.2094677Z vendor_id : AuthenticAMD 2025-05-07T20:23:12.2094916Z cpu family : 23 2025-05-07T20:23:12.2095116Z model : 49 2025-05-07T20:23:12.2095322Z model name : AMD EPYC 7R32 2025-05-07T20:23:12.2095564Z stepping : 0 2025-05-07T20:23:12.2095767Z microcode : 0x830107f 2025-05-07T20:23:12.2095993Z cpu MHz : 3310.359 2025-05-07T20:23:12.2096214Z cache size : 512 KB 2025-05-07T20:23:12.2096427Z physical id : 0 2025-05-07T20:23:12.2096634Z siblings : 16 2025-05-07T20:23:12.2096831Z core id : 6 2025-05-07T20:23:12.2097028Z cpu cores : 8 2025-05-07T20:23:12.2097226Z apicid : 13 2025-05-07T20:23:12.2097432Z initial apicid : 13 2025-05-07T20:23:12.2097638Z fpu : yes 2025-05-07T20:23:12.2097841Z fpu_exception : yes 2025-05-07T20:23:12.2098057Z cpuid level : 13 2025-05-07T20:23:12.2098276Z wp : yes 2025-05-07T20:23:12.2100431Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:12.2103677Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:12.2104329Z bogomips : 5600.00 2025-05-07T20:23:12.2104547Z TLB size : 3072 4K pages 2025-05-07T20:23:12.2104777Z clflush size : 64 2025-05-07T20:23:12.2104988Z cache_alignment : 64 2025-05-07T20:23:12.2105248Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:12.2105562Z power management: 2025-05-07T20:23:12.2105691Z 2025-05-07T20:23:12.2105927Z processor : 15 2025-05-07T20:23:12.2106144Z vendor_id : AuthenticAMD 2025-05-07T20:23:12.2106379Z cpu family : 23 2025-05-07T20:23:12.2106574Z model : 49 2025-05-07T20:23:12.2106774Z model name : AMD EPYC 7R32 2025-05-07T20:23:12.2107017Z stepping : 0 2025-05-07T20:23:12.2107217Z microcode : 0x830107f 2025-05-07T20:23:12.2107439Z cpu MHz : 2912.833 2025-05-07T20:23:12.2107650Z cache size : 512 KB 2025-05-07T20:23:12.2107856Z physical id : 0 2025-05-07T20:23:12.2108069Z siblings : 16 2025-05-07T20:23:12.2108265Z core id : 7 2025-05-07T20:23:12.2108451Z cpu cores : 8 2025-05-07T20:23:12.2108650Z apicid : 15 2025-05-07T20:23:12.2108849Z initial apicid : 15 2025-05-07T20:23:12.2109058Z fpu : yes 2025-05-07T20:23:12.2109256Z fpu_exception : yes 2025-05-07T20:23:12.2109464Z cpuid level : 13 2025-05-07T20:23:12.2109662Z wp : yes 2025-05-07T20:23:12.2111917Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:12.2114368Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:12.2114878Z bogomips : 5600.00 2025-05-07T20:23:12.2115095Z TLB size : 3072 4K pages 2025-05-07T20:23:12.2115326Z clflush size : 64 2025-05-07T20:23:12.2115540Z cache_alignment : 64 2025-05-07T20:23:12.2115809Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:12.2116124Z power management: 2025-05-07T20:23:12.2116257Z 2025-05-07T20:23:12.2116262Z 2025-05-07T20:23:12.2116380Z ################################################################################ 2025-05-07T20:23:12.2116699Z [INFO] Print PCI info ... 2025-05-07T20:23:12.2116936Z + lspci -v 2025-05-07T20:23:12.2117054Z 2025-05-07T20:23:12.2117274Z 00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] 2025-05-07T20:23:12.2117674Z Subsystem: Amazon.com, Inc. Device 1237 2025-05-07T20:23:12.2118004Z Flags: bus master, medium devsel, latency 0 2025-05-07T20:23:12.2118219Z 2025-05-07T20:23:12.2118429Z 00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II] 2025-05-07T20:23:12.2118842Z Physical Slot: 1 2025-05-07T20:23:12.2119181Z Flags: bus master, fast devsel, latency 0 2025-05-07T20:23:12.2119475Z 2025-05-07T20:23:12.2119836Z 00:01.3 Non-VGA unclassified device: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 08) 2025-05-07T20:23:12.2120422Z Physical Slot: 1 2025-05-07T20:23:12.2120683Z Flags: bus master, fast devsel, latency 0, IRQ 9 2025-05-07T20:23:12.2120922Z 2025-05-07T20:23:12.2121198Z 00:03.0 VGA compatible controller: Amazon.com, Inc. Device 1111 (prog-if 00 [VGA controller]) 2025-05-07T20:23:12.2121665Z Physical Slot: 3 2025-05-07T20:23:12.2121902Z Flags: bus master, fast devsel, latency 0 2025-05-07T20:23:12.2122249Z Memory at c1000000 (32-bit, prefetchable) [size=4M] 2025-05-07T20:23:12.2122612Z Expansion ROM at 000c0000 [disabled] [size=128K] 2025-05-07T20:23:12.2122843Z 2025-05-07T20:23:12.2123152Z 00:04.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe EBS Controller (prog-if 02 [NVM Express]) 2025-05-07T20:23:12.2123822Z Subsystem: Amazon.com, Inc. Device 0000 2025-05-07T20:23:12.2124114Z Physical Slot: 4 2025-05-07T20:23:12.2124373Z Flags: bus master, fast devsel, latency 0, IRQ 11 2025-05-07T20:23:12.2124756Z Memory at c1808000 (32-bit, non-prefetchable) [size=16K] 2025-05-07T20:23:12.2125115Z Capabilities: 2025-05-07T20:23:12.2125387Z Kernel driver in use: nvme 2025-05-07T20:23:12.2125554Z 2025-05-07T20:23:12.2125858Z 00:05.0 Ethernet controller: Amazon.com, Inc. Elastic Network Adapter (ENA) 2025-05-07T20:23:12.2126353Z Subsystem: Amazon.com, Inc. Elastic Network Adapter (ENA) 2025-05-07T20:23:12.2126708Z Physical Slot: 5 2025-05-07T20:23:12.2126942Z Flags: bus master, fast devsel, latency 0 2025-05-07T20:23:12.2127301Z Memory at c1804000 (32-bit, non-prefetchable) [size=16K] 2025-05-07T20:23:12.2127689Z Memory at c1400000 (32-bit, prefetchable) [size=4M] 2025-05-07T20:23:12.2128023Z Capabilities: 2025-05-07T20:23:12.2128290Z Kernel driver in use: ena 2025-05-07T20:23:12.2128532Z Kernel modules: ena 2025-05-07T20:23:12.2128670Z 2025-05-07T20:23:12.2128847Z 00:1e.0 3D controller: NVIDIA Corporation GA102GL [A10G] (rev a1) 2025-05-07T20:23:12.2129228Z Subsystem: NVIDIA Corporation Device 152f 2025-05-07T20:23:12.2129524Z Physical Slot: 30 2025-05-07T20:23:12.2129779Z Flags: bus master, fast devsel, latency 0, IRQ 10 2025-05-07T20:23:12.2130160Z Memory at c0000000 (32-bit, non-prefetchable) [size=16M] 2025-05-07T20:23:12.2130561Z Memory at 1800000000 (64-bit, prefetchable) [size=32G] 2025-05-07T20:23:12.2130941Z Memory at 1040000000 (64-bit, prefetchable) [size=32M] 2025-05-07T20:23:12.2131278Z Capabilities: 2025-05-07T20:23:12.2131539Z Kernel driver in use: nvidia 2025-05-07T20:23:12.2131796Z Kernel modules: nvidia 2025-05-07T20:23:12.2131941Z 2025-05-07T20:23:12.2132265Z 00:1f.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe SSD Controller (prog-if 02 [NVM Express]) 2025-05-07T20:23:12.2140179Z Subsystem: Amazon.com, Inc. Device 0000 2025-05-07T20:23:12.2140503Z Physical Slot: 31 2025-05-07T20:23:12.2140756Z Flags: bus master, fast devsel, latency 0 2025-05-07T20:23:12.2141134Z Memory at c1800000 (32-bit, non-prefetchable) [size=16K] 2025-05-07T20:23:12.2141536Z Memory at c180c000 (32-bit, prefetchable) [size=8K] 2025-05-07T20:23:12.2141873Z Capabilities: 2025-05-07T20:23:12.2142145Z Kernel driver in use: nvme 2025-05-07T20:23:12.2142312Z 2025-05-07T20:23:12.2142317Z 2025-05-07T20:23:12.2142447Z ################################################################################ 2025-05-07T20:23:12.2142777Z [INFO] Print Linux distribution info ... 2025-05-07T20:23:12.2143071Z + uname -a 2025-05-07T20:23:12.2143192Z 2025-05-07T20:23:12.2143629Z Linux ip-10-0-51-101.ec2.internal 6.1.130-139.222.amzn2023.x86_64 #1 SMP PREEMPT_DYNAMIC Tue Mar 11 01:10:58 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux 2025-05-07T20:23:12.2144164Z 2025-05-07T20:23:12.2144253Z + uname -m 2025-05-07T20:23:12.2144367Z 2025-05-07T20:23:12.2144441Z x86_64 2025-05-07T20:23:12.2144554Z 2025-05-07T20:23:12.2144637Z + cat /proc/version 2025-05-07T20:23:12.2144770Z 2025-05-07T20:23:12.2145350Z Linux version 6.1.130-139.222.amzn2023.x86_64 (mockbuild@ip-10-0-55-76) (gcc (GCC) 11.5.0 20240719 (Red Hat 11.5.0-5), GNU ld version 2.39-6.amzn2023.0.11) #1 SMP PREEMPT_DYNAMIC Tue Mar 11 01:10:58 UTC 2025 2025-05-07T20:23:12.2146022Z 2025-05-07T20:23:12.2146109Z + cat /etc/os-release 2025-05-07T20:23:12.2146251Z 2025-05-07T20:23:12.2146340Z NAME="Amazon Linux" 2025-05-07T20:23:12.2146557Z VERSION="2023" 2025-05-07T20:23:12.2146758Z ID="amzn" 2025-05-07T20:23:12.2146941Z ID_LIKE="fedora" 2025-05-07T20:23:12.2147145Z VERSION_ID="2023" 2025-05-07T20:23:12.2147370Z PLATFORM_ID="platform:al2023" 2025-05-07T20:23:12.2147646Z PRETTY_NAME="Amazon Linux 2023.6.20250317" 2025-05-07T20:23:12.2147936Z ANSI_COLOR="0;33" 2025-05-07T20:23:12.2148183Z CPE_NAME="cpe:2.3:o:amazon:amazon_linux:2023" 2025-05-07T20:23:12.2148698Z HOME_URL="https://aws.amazon.com/linux/amazon-linux-2023/" 2025-05-07T20:23:12.2149138Z DOCUMENTATION_URL="https://docs.aws.amazon.com/linux/" 2025-05-07T20:23:12.2149571Z SUPPORT_URL="https://aws.amazon.com/premiumsupport/" 2025-05-07T20:23:12.2150137Z BUG_REPORT_URL="https://github.com/amazonlinux/amazon-linux-2023" 2025-05-07T20:23:12.2150517Z VENDOR_NAME="AWS" 2025-05-07T20:23:12.2150758Z VENDOR_URL="https://aws.amazon.com/" 2025-05-07T20:23:12.2151051Z SUPPORT_END="2029-06-30" 2025-05-07T20:23:12.2151205Z 2025-05-07T20:23:12.2151413Z ################################################################################ 2025-05-07T20:23:12.2151734Z # Print EC2 Instance Info 2025-05-07T20:23:12.2151965Z # 2025-05-07T20:23:12.2152177Z # [2025-05-07T20:23:12.211Z] + print_ec2_info 2025-05-07T20:23:12.2152492Z ################################################################################ 2025-05-07T20:23:12.2152718Z 2025-05-07T20:23:12.2241172Z ami-id: ami-071226ecf16aa7d96 2025-05-07T20:23:12.2363955Z instance-id: i-0efa96680de6b8d22 2025-05-07T20:23:12.2479409Z instance-type: g5.4xlarge 2025-05-07T20:23:12.2520759Z ##[group]Run . $PRELUDE; print_gpu_info 2025-05-07T20:23:12.2521128Z . $PRELUDE; print_gpu_info 2025-05-07T20:23:12.2531264Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:23:12.2531630Z env: 2025-05-07T20:23:12.2531845Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:23:12.2532152Z BUILD_ENV: build_binary 2025-05-07T20:23:12.2532401Z BUILD_TARGET: genai 2025-05-07T20:23:12.2532632Z BUILD_VARIANT: cuda 2025-05-07T20:23:12.2532863Z BUILD_CUDA_VERSION: 12.8.0 2025-05-07T20:23:12.2533124Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:23:12.2533430Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:23:12.2533762Z ##[endgroup] 2025-05-07T20:23:12.5834332Z ################################################################################ 2025-05-07T20:23:12.5834833Z [INFO] Printing general display info ... 2025-05-07T20:23:12.5866822Z [EXEC] [ATTEMPT 0/3] + wget -q --timeout 1 pypi.org -O /dev/null 2025-05-07T20:23:12.6976475Z [CHECK] Network does not appear to be blocked. 2025-05-07T20:23:12.6986642Z /usr/bin/sudo 2025-05-07T20:23:12.6996963Z which: no apt-get in (/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin) 2025-05-07T20:23:12.7007811Z /usr/bin/yum 2025-05-07T20:23:12.7009436Z [INSTALL] Updating system repositories ... 2025-05-07T20:23:12.7029473Z [EXEC] [ATTEMPT 0/3] + sudo yum update -y 2025-05-07T20:23:13.1537045Z Last metadata expiration check: 0:00:09 ago on Wed May 7 20:23:04 2025. 2025-05-07T20:23:13.2258894Z ================================================================================ 2025-05-07T20:23:13.2259405Z WARNING: 2025-05-07T20:23:13.2259692Z A newer release of "Amazon Linux" is available. 2025-05-07T20:23:13.2259936Z 2025-05-07T20:23:13.2260030Z Available Versions: 2025-05-07T20:23:13.2260189Z 2025-05-07T20:23:13.2260279Z Version 2023.7.20250331: 2025-05-07T20:23:13.2260604Z Run the following command to upgrade to 2023.7.20250331: 2025-05-07T20:23:13.2260894Z 2025-05-07T20:23:13.2261026Z dnf upgrade --releasever=2023.7.20250331 2025-05-07T20:23:13.2261245Z 2025-05-07T20:23:13.2261335Z Release notes: 2025-05-07T20:23:13.2261748Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250331.html 2025-05-07T20:23:13.2262146Z 2025-05-07T20:23:13.2262232Z Version 2023.7.20250414: 2025-05-07T20:23:13.2262544Z Run the following command to upgrade to 2023.7.20250414: 2025-05-07T20:23:13.2262800Z 2025-05-07T20:23:13.2262917Z dnf upgrade --releasever=2023.7.20250414 2025-05-07T20:23:13.2263131Z 2025-05-07T20:23:13.2263213Z Release notes: 2025-05-07T20:23:13.2263616Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250414.html 2025-05-07T20:23:13.2263997Z 2025-05-07T20:23:13.2264088Z Version 2023.7.20250428: 2025-05-07T20:23:13.2264392Z Run the following command to upgrade to 2023.7.20250428: 2025-05-07T20:23:13.2264878Z 2025-05-07T20:23:13.2264988Z dnf upgrade --releasever=2023.7.20250428 2025-05-07T20:23:13.2265216Z 2025-05-07T20:23:13.2265299Z Release notes: 2025-05-07T20:23:13.2265697Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250428.html 2025-05-07T20:23:13.2266077Z 2025-05-07T20:23:13.2266188Z ================================================================================ 2025-05-07T20:23:13.3419238Z Dependencies resolved. 2025-05-07T20:23:13.3703884Z ================================================================================ 2025-05-07T20:23:13.3704322Z Package Arch Version Repository Size 2025-05-07T20:23:13.3704743Z ================================================================================ 2025-05-07T20:23:13.3705049Z Upgrading: 2025-05-07T20:23:13.3705416Z nvidia-container-toolkit x86_64 1.17.6-1 nvidia-container-toolkit 1.2 M 2025-05-07T20:23:13.3706024Z nvidia-container-toolkit-base x86_64 1.17.6-1 nvidia-container-toolkit 5.7 M 2025-05-07T20:23:13.3706418Z 2025-05-07T20:23:13.3706799Z Transaction Summary 2025-05-07T20:23:13.3707059Z ================================================================================ 2025-05-07T20:23:13.3707372Z Upgrade 2 Packages 2025-05-07T20:23:13.3707513Z 2025-05-07T20:23:13.3707614Z Total download size: 6.9 M 2025-05-07T20:23:13.3708423Z Downloading Packages: 2025-05-07T20:23:13.4224423Z (1/2): nvidia-container-toolkit-1.17.6-1.x86_64 25 MB/s | 1.2 MB 00:00 2025-05-07T20:23:13.4476703Z (2/2): nvidia-container-toolkit-base-1.17.6-1.x 75 MB/s | 5.7 MB 00:00 2025-05-07T20:23:13.4486129Z -------------------------------------------------------------------------------- 2025-05-07T20:23:13.4489022Z Total 89 MB/s | 6.9 MB 00:00 2025-05-07T20:23:13.4491520Z Running transaction check 2025-05-07T20:23:13.4586810Z Transaction check succeeded. 2025-05-07T20:23:13.4587283Z Running transaction test 2025-05-07T20:23:13.4881208Z Transaction test succeeded. 2025-05-07T20:23:13.4884174Z Running transaction 2025-05-07T20:23:14.0427946Z Preparing : 1/1 2025-05-07T20:23:14.1485830Z Upgrading : nvidia-container-toolkit-base-1.17.6-1.x86_64 1/4 2025-05-07T20:23:14.1505919Z Upgrading : nvidia-container-toolkit-1.17.6-1.x86_64 2/4 2025-05-07T20:23:14.1707320Z Running scriptlet: nvidia-container-toolkit-1.17.6-1.x86_64 2/4 2025-05-07T20:23:14.1708129Z Cleanup : nvidia-container-toolkit-1.16.2-1.x86_64 3/4 2025-05-07T20:23:14.1810031Z Running scriptlet: nvidia-container-toolkit-1.16.2-1.x86_64 3/4 2025-05-07T20:23:14.1832226Z Cleanup : nvidia-container-toolkit-base-1.16.2-1.x86_64 4/4 2025-05-07T20:23:14.3253198Z Running scriptlet: nvidia-container-toolkit-1.17.6-1.x86_64 4/4 2025-05-07T20:23:14.3254397Z Verifying : nvidia-container-toolkit-1.17.6-1.x86_64 1/4 2025-05-07T20:23:14.3255558Z Verifying : nvidia-container-toolkit-1.16.2-1.x86_64 2/4 2025-05-07T20:23:14.3256666Z Verifying : nvidia-container-toolkit-base-1.17.6-1.x86_64 3/4 2025-05-07T20:23:14.4694814Z ================================================================================ 2025-05-07T20:23:14.4695247Z WARNING: 2025-05-07T20:23:14.4695533Z A newer release of "Amazon Linux" is available. 2025-05-07T20:23:14.4695777Z 2025-05-07T20:23:14.4695878Z Available Versions: 2025-05-07T20:23:14.4696031Z 2025-05-07T20:23:14.4696127Z Version 2023.7.20250331: 2025-05-07T20:23:14.4696444Z Run the following command to upgrade to 2023.7.20250331: 2025-05-07T20:23:14.4696716Z 2025-05-07T20:23:14.4696840Z dnf upgrade --releasever=2023.7.20250331 2025-05-07T20:23:14.4697060Z 2025-05-07T20:23:14.4697156Z Release notes: 2025-05-07T20:23:14.4697580Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250331.html 2025-05-07T20:23:14.4698249Z 2025-05-07T20:23:14.4698348Z Version 2023.7.20250414: 2025-05-07T20:23:14.4698666Z Run the following command to upgrade to 2023.7.20250414: 2025-05-07T20:23:14.4698923Z 2025-05-07T20:23:14.4699047Z dnf upgrade --releasever=2023.7.20250414 2025-05-07T20:23:14.4699264Z 2025-05-07T20:23:14.4699353Z Release notes: 2025-05-07T20:23:14.4699765Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250414.html 2025-05-07T20:23:14.4700151Z 2025-05-07T20:23:14.4700252Z Version 2023.7.20250428: 2025-05-07T20:23:14.4700564Z Run the following command to upgrade to 2023.7.20250428: 2025-05-07T20:23:14.4700835Z 2025-05-07T20:23:14.4700950Z dnf upgrade --releasever=2023.7.20250428 2025-05-07T20:23:14.4701176Z 2025-05-07T20:23:14.4701263Z Release notes: 2025-05-07T20:23:14.4701672Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250428.html 2025-05-07T20:23:14.4702063Z 2025-05-07T20:23:14.4702410Z ================================================================================ 2025-05-07T20:23:14.5255534Z Verifying : nvidia-container-toolkit-base-1.16.2-1.x86_64 4/4 2025-05-07T20:23:14.5255911Z 2025-05-07T20:23:14.5256000Z Upgraded: 2025-05-07T20:23:14.5256367Z nvidia-container-toolkit-1.17.6-1.x86_64 2025-05-07T20:23:14.5256966Z nvidia-container-toolkit-base-1.17.6-1.x86_64 2025-05-07T20:23:14.5257338Z 2025-05-07T20:23:14.5257423Z Complete! 2025-05-07T20:23:14.5696091Z [INSTALL] Installing system package(s): hostname lshw ... 2025-05-07T20:23:14.5718270Z [EXEC] [ATTEMPT 0/3] + sudo yum install -y hostname lshw 2025-05-07T20:23:15.0169786Z Last metadata expiration check: 0:00:11 ago on Wed May 7 20:23:04 2025. 2025-05-07T20:23:15.0409991Z Package hostname-3.23-4.amzn2023.0.3.x86_64 is already installed. 2025-05-07T20:23:15.0812665Z Dependencies resolved. 2025-05-07T20:23:15.0990964Z ================================================================================ 2025-05-07T20:23:15.0991459Z Package Architecture Version Repository Size 2025-05-07T20:23:15.0991995Z ================================================================================ 2025-05-07T20:23:15.0992352Z Installing: 2025-05-07T20:23:15.0992647Z lshw x86_64 B.02.19.2-7.amzn2023.0.3 amazonlinux 319 k 2025-05-07T20:23:15.0992927Z 2025-05-07T20:23:15.0993022Z Transaction Summary 2025-05-07T20:23:15.0993261Z ================================================================================ 2025-05-07T20:23:15.0993568Z Install 1 Package 2025-05-07T20:23:15.0993698Z 2025-05-07T20:23:15.0993818Z Total download size: 319 k 2025-05-07T20:23:15.0994474Z Installed size: 837 k 2025-05-07T20:23:15.0995684Z Downloading Packages: 2025-05-07T20:23:15.1870608Z lshw-B.02.19.2-7.amzn2023.0.3.x86_64.rpm 6.5 MB/s | 319 kB 00:00 2025-05-07T20:23:15.1876196Z -------------------------------------------------------------------------------- 2025-05-07T20:23:15.1879032Z Total 3.5 MB/s | 319 kB 00:00 2025-05-07T20:23:15.2035122Z Running transaction check 2025-05-07T20:23:15.2090508Z Transaction check succeeded. 2025-05-07T20:23:15.2090938Z Running transaction test 2025-05-07T20:23:15.2544255Z Transaction test succeeded. 2025-05-07T20:23:15.2547806Z Running transaction 2025-05-07T20:23:15.3544363Z Preparing : 1/1 2025-05-07T20:23:15.4032752Z Installing : lshw-B.02.19.2-7.amzn2023.0.3.x86_64 1/1 2025-05-07T20:23:15.5561550Z Running scriptlet: lshw-B.02.19.2-7.amzn2023.0.3.x86_64 1/1 2025-05-07T20:23:15.6770964Z ================================================================================ 2025-05-07T20:23:15.6771347Z WARNING: 2025-05-07T20:23:15.6771679Z A newer release of "Amazon Linux" is available. 2025-05-07T20:23:15.6772207Z 2025-05-07T20:23:15.6772302Z Available Versions: 2025-05-07T20:23:15.6772467Z 2025-05-07T20:23:15.6772555Z Version 2023.7.20250331: 2025-05-07T20:23:15.6772870Z Run the following command to upgrade to 2023.7.20250331: 2025-05-07T20:23:15.6773130Z 2025-05-07T20:23:15.6773252Z dnf upgrade --releasever=2023.7.20250331 2025-05-07T20:23:15.6773473Z 2025-05-07T20:23:15.6773560Z Release notes: 2025-05-07T20:23:15.6773977Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250331.html 2025-05-07T20:23:15.6774366Z 2025-05-07T20:23:15.6774461Z Version 2023.7.20250414: 2025-05-07T20:23:15.6774764Z Run the following command to upgrade to 2023.7.20250414: 2025-05-07T20:23:15.6775030Z 2025-05-07T20:23:15.6775149Z dnf upgrade --releasever=2023.7.20250414 2025-05-07T20:23:15.6775366Z 2025-05-07T20:23:15.6775455Z Release notes: 2025-05-07T20:23:15.6775851Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250414.html 2025-05-07T20:23:15.6776256Z 2025-05-07T20:23:15.6776516Z Version 2023.7.20250428: 2025-05-07T20:23:15.6776833Z Run the following command to upgrade to 2023.7.20250428: 2025-05-07T20:23:15.6777091Z 2025-05-07T20:23:15.6777210Z dnf upgrade --releasever=2023.7.20250428 2025-05-07T20:23:15.6777422Z 2025-05-07T20:23:15.6777507Z Release notes: 2025-05-07T20:23:15.6777912Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250428.html 2025-05-07T20:23:15.6778300Z 2025-05-07T20:23:15.6778411Z ================================================================================ 2025-05-07T20:23:15.7115570Z Verifying : lshw-B.02.19.2-7.amzn2023.0.3.x86_64 1/1 2025-05-07T20:23:15.7115925Z 2025-05-07T20:23:15.7116013Z Installed: 2025-05-07T20:23:15.7116323Z lshw-B.02.19.2-7.amzn2023.0.3.x86_64 2025-05-07T20:23:15.7116634Z 2025-05-07T20:23:15.7116713Z Complete! 2025-05-07T20:23:15.7558933Z + hostname 2025-05-07T20:23:15.7559138Z 2025-05-07T20:23:15.7572871Z ip-10-0-51-101.ec2.internal 2025-05-07T20:23:15.7574268Z 2025-05-07T20:23:15.7574683Z + sudo lshw -C display 2025-05-07T20:23:15.7574923Z 2025-05-07T20:23:16.2697258Z *-display:0 UNCLAIMED 2025-05-07T20:23:16.2697676Z description: VGA compatible controller 2025-05-07T20:23:16.2698011Z product: Amazon.com, Inc. 2025-05-07T20:23:16.2698282Z vendor: Amazon.com, Inc. 2025-05-07T20:23:16.2698544Z physical id: 3 2025-05-07T20:23:16.2698783Z bus info: pci@0000:00:03.0 2025-05-07T20:23:16.2699065Z version: 00 2025-05-07T20:23:16.2699297Z width: 32 bits 2025-05-07T20:23:16.2699526Z clock: 33MHz 2025-05-07T20:23:16.2699771Z capabilities: vga_controller bus_master 2025-05-07T20:23:16.2700081Z configuration: latency=0 2025-05-07T20:23:16.2700412Z resources: memory:c1000000-c13fffff memory:c0000-dffff 2025-05-07T20:23:16.2700755Z *-display:1 2025-05-07T20:23:16.2700965Z description: 3D controller 2025-05-07T20:23:16.2701275Z product: GA102GL [A10G] 2025-05-07T20:23:16.2701540Z vendor: NVIDIA Corporation 2025-05-07T20:23:16.2701797Z physical id: 1e 2025-05-07T20:23:16.2702030Z bus info: pci@0000:00:1e.0 2025-05-07T20:23:16.2702284Z version: a1 2025-05-07T20:23:16.2702483Z width: 64 bits 2025-05-07T20:23:16.2702702Z clock: 33MHz 2025-05-07T20:23:16.2702990Z capabilities: pm pciexpress msix bus_master cap_list 2025-05-07T20:23:16.2703364Z configuration: driver=nvidia latency=0 2025-05-07T20:23:16.2704014Z resources: iomemory:180-17f iomemory:100-ff irq:10 memory:c0000000-c0ffffff memory:1800000000-1fffffffff memory:1040000000-1041ffffff 2025-05-07T20:23:16.2736727Z 2025-05-07T20:23:16.2737087Z ################################################################################ 2025-05-07T20:23:16.2737447Z [INFO] Printing NVIDIA GPU info ... 2025-05-07T20:23:16.2864342Z 00:1e.0 3D controller: NVIDIA Corporation GA102GL [A10G] (rev a1) 2025-05-07T20:23:16.3034182Z Wed May 7 20:23:16 2025 2025-05-07T20:23:16.3034650Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:23:16.3035181Z | NVIDIA-SMI 570.133.07 Driver Version: 570.133.07 CUDA Version: 12.8 | 2025-05-07T20:23:16.3035690Z |-----------------------------------------+------------------------+----------------------+ 2025-05-07T20:23:16.3036206Z | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | 2025-05-07T20:23:16.3036756Z | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | 2025-05-07T20:23:16.3037206Z | | | MIG M. | 2025-05-07T20:23:16.3037549Z |=========================================+========================+======================| 2025-05-07T20:23:16.3116189Z | 0 NVIDIA A10G On | 00000000:00:1E.0 Off | 0 | 2025-05-07T20:23:16.3118129Z | 0% 30C P0 60W / 300W | 0MiB / 23028MiB | 0% Default | 2025-05-07T20:23:16.3118567Z | | | N/A | 2025-05-07T20:23:16.3118974Z +-----------------------------------------+------------------------+----------------------+ 2025-05-07T20:23:16.3119383Z 2025-05-07T20:23:16.3119787Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:23:16.3120234Z | Processes: | 2025-05-07T20:23:16.3120688Z | GPU GI CI PID Type Process name GPU Memory | 2025-05-07T20:23:16.3121116Z | ID ID Usage | 2025-05-07T20:23:16.3121478Z |=========================================================================================| 2025-05-07T20:23:16.3121913Z | No running processes found | 2025-05-07T20:23:16.3122402Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:23:16.4565360Z ################################################################################ 2025-05-07T20:23:16.4565732Z [INFO] Printing AMD GPU info ... 2025-05-07T20:23:16.4707883Z which: no rocminfo in (/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin) 2025-05-07T20:23:16.4708895Z [CHECK] rocminfo not found 2025-05-07T20:23:16.4718003Z which: no rocm-smi in (/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin) 2025-05-07T20:23:16.4719153Z [CHECK] rocm-smi not found 2025-05-07T20:23:16.4780691Z ##[group]Run . $PRELUDE; setup_miniconda $HOME/miniconda 2025-05-07T20:23:16.4781145Z . $PRELUDE; setup_miniconda $HOME/miniconda 2025-05-07T20:23:16.4793171Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:23:16.4793533Z env: 2025-05-07T20:23:16.4793755Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:23:16.4794055Z BUILD_ENV: build_binary 2025-05-07T20:23:16.4794297Z BUILD_TARGET: genai 2025-05-07T20:23:16.4794527Z BUILD_VARIANT: cuda 2025-05-07T20:23:16.4794753Z BUILD_CUDA_VERSION: 12.8.0 2025-05-07T20:23:16.4795005Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:23:16.4795305Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:23:16.4795643Z ##[endgroup] 2025-05-07T20:23:16.8134557Z ################################################################################ 2025-05-07T20:23:16.8134932Z # Setup Miniconda 2025-05-07T20:23:16.8135148Z # 2025-05-07T20:23:16.8149027Z # [2025-05-07T20:23:16.814Z] + setup_miniconda /home/ec2-user/miniconda 2025-05-07T20:23:16.8149446Z ################################################################################ 2025-05-07T20:23:16.8149674Z 2025-05-07T20:23:16.8164618Z [EXEC] [ATTEMPT 0/3] + wget -q --timeout 1 pypi.org -O /dev/null 2025-05-07T20:23:16.9037705Z [CHECK] Network does not appear to be blocked. 2025-05-07T20:23:16.9038075Z + mkdir -p /home/ec2-user/miniconda 2025-05-07T20:23:16.9038277Z 2025-05-07T20:23:16.9054216Z 2025-05-07T20:23:16.9054565Z [SETUP] Downloading the Miniconda installer ... 2025-05-07T20:23:16.9075716Z [EXEC] [ATTEMPT 0/3] + wget -q https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda.sh 2025-05-07T20:23:17.9380649Z [SETUP] Installing Miniconda ... 2025-05-07T20:23:17.9381156Z + bash miniconda.sh -b -p /home/ec2-user/miniconda -u 2025-05-07T20:23:17.9381502Z 2025-05-07T20:23:17.9525527Z PREFIX=/home/ec2-user/miniconda 2025-05-07T20:23:18.3962745Z Unpacking payload ... 2025-05-07T20:23:18.9148462Z entry_point.py:256: DeprecationWarning: Python 3.14 will, by default, filter extracted tar archives and reject files or modify their metadata. Use the filter argument to control this behavior. 2025-05-07T20:23:19.7165956Z entry_point.py:256: DeprecationWarning: Python 3.14 will, by default, filter extracted tar archives and reject files or modify their metadata. Use the filter argument to control this behavior. 2025-05-07T20:23:21.8166958Z 2025-05-07T20:23:21.8167482Z Installing base environment... 2025-05-07T20:23:21.8167712Z 2025-05-07T20:23:22.9022572Z Preparing transaction: ...working... done 2025-05-07T20:23:25.8392099Z Executing transaction: ...working... done 2025-05-07T20:23:26.4991976Z entry_point.py:256: DeprecationWarning: Python 3.14 will, by default, filter extracted tar archives and reject files or modify their metadata. Use the filter argument to control this behavior. 2025-05-07T20:23:26.5880972Z installation finished. 2025-05-07T20:23:26.5890468Z 2025-05-07T20:23:26.5890857Z + rm -f miniconda.sh 2025-05-07T20:23:26.5891041Z 2025-05-07T20:23:26.6201428Z 2025-05-07T20:23:26.6201800Z [SETUP] Reloading the bash configuration ... 2025-05-07T20:23:26.6202166Z + /home/ec2-user/miniconda/bin/conda init bash 2025-05-07T20:23:26.6202425Z 2025-05-07T20:23:26.9838789Z no change /home/ec2-user/miniconda/condabin/conda 2025-05-07T20:23:26.9839379Z no change /home/ec2-user/miniconda/bin/conda 2025-05-07T20:23:26.9839938Z no change /home/ec2-user/miniconda/bin/conda-env 2025-05-07T20:23:26.9840444Z no change /home/ec2-user/miniconda/bin/activate 2025-05-07T20:23:26.9840977Z no change /home/ec2-user/miniconda/bin/deactivate 2025-05-07T20:23:26.9841554Z no change /home/ec2-user/miniconda/etc/profile.d/conda.sh 2025-05-07T20:23:26.9842172Z no change /home/ec2-user/miniconda/etc/fish/conf.d/conda.fish 2025-05-07T20:23:26.9842794Z no change /home/ec2-user/miniconda/shell/condabin/Conda.psm1 2025-05-07T20:23:26.9843445Z no change /home/ec2-user/miniconda/shell/condabin/conda-hook.ps1 2025-05-07T20:23:26.9844650Z no change /home/ec2-user/miniconda/lib/python3.13/site-packages/xontrib/conda.xsh 2025-05-07T20:23:26.9845433Z no change /home/ec2-user/miniconda/etc/profile.d/conda.csh 2025-05-07T20:23:26.9845983Z modified /home/ec2-user/.bashrc 2025-05-07T20:23:26.9846268Z 2025-05-07T20:23:26.9846582Z ==> For changes to take effect, close and re-open your current shell. <== 2025-05-07T20:23:26.9847019Z 2025-05-07T20:23:27.0487788Z 2025-05-07T20:23:27.0488406Z + . /home/ec2-user/.bashrc 2025-05-07T20:23:27.0488621Z 2025-05-07T20:23:27.8827584Z 2025-05-07T20:23:27.8828139Z [SETUP] Installing libmamba-solver (required since Anaconda 2024.02-1) and libarchive ... 2025-05-07T20:23:27.8852812Z [EXEC] [ATTEMPT 0/3] + conda install --solver=classic -c conda-forge --override-channels -y conda-libmamba-solver libmamba libmambapy libarchive 2025-05-07T20:23:41.3557682Z Collecting package metadata (current_repodata.json): - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ done 2025-05-07T20:23:42.9419421Z Solving environment: / - \ | / - \ | / - \ | done 2025-05-07T20:23:43.0383536Z 2025-05-07T20:23:43.0383886Z ## Package Plan ## 2025-05-07T20:23:43.0384043Z 2025-05-07T20:23:43.0384208Z environment location: /home/ec2-user/miniconda 2025-05-07T20:23:43.0384477Z 2025-05-07T20:23:43.0384574Z added / updated specs: 2025-05-07T20:23:43.0384846Z - conda-libmamba-solver 2025-05-07T20:23:43.0385098Z - libarchive 2025-05-07T20:23:43.0385310Z - libmamba 2025-05-07T20:23:43.0385514Z - libmambapy 2025-05-07T20:23:43.0385641Z 2025-05-07T20:23:43.0385645Z 2025-05-07T20:23:43.0385782Z The following packages will be downloaded: 2025-05-07T20:23:43.0386013Z 2025-05-07T20:23:43.0386125Z package | build 2025-05-07T20:23:43.0386456Z ---------------------------|----------------- 2025-05-07T20:23:43.0386898Z ca-certificates-2025.4.26 | hbd8a1cb_0 149 KB conda-forge 2025-05-07T20:23:43.0387404Z certifi-2025.4.26 | pyhd8ed1ab_0 154 KB conda-forge 2025-05-07T20:23:43.0387862Z conda-25.3.1 | py313h78bf25f_1 1.1 MB conda-forge 2025-05-07T20:23:43.0388365Z conda-libmamba-solver-25.4.0| pyhd8ed1ab_0 41 KB conda-forge 2025-05-07T20:23:43.0388828Z ------------------------------------------------------------ 2025-05-07T20:23:43.0389184Z Total: 1.4 MB 2025-05-07T20:23:43.0389406Z 2025-05-07T20:23:43.0389517Z The following packages will be UPDATED: 2025-05-07T20:23:43.0389731Z 2025-05-07T20:23:43.0395729Z ca-certificates pkgs/main/linux-64::ca-certificates-2~ --> conda-forge/noarch::ca-certificates-2025.4.26-hbd8a1cb_0 2025-05-07T20:23:43.0396565Z conda pkgs/main::conda-25.3.1-py313h06a4308~ --> conda-forge::conda-25.3.1-py313h78bf25f_1 2025-05-07T20:23:43.0396974Z 2025-05-07T20:23:43.0397199Z The following packages will be SUPERSEDED by a higher-priority channel: 2025-05-07T20:23:43.0397535Z 2025-05-07T20:23:43.0397871Z certifi pkgs/main/linux-64::certifi-2025.4.26~ --> conda-forge/noarch::certifi-2025.4.26-pyhd8ed1ab_0 2025-05-07T20:23:43.0398719Z conda-libmamba-so~ pkgs/main::conda-libmamba-solver-25.4~ --> conda-forge::conda-libmamba-solver-25.4.0-pyhd8ed1ab_0 2025-05-07T20:23:43.0399236Z 2025-05-07T20:23:43.0399240Z 2025-05-07T20:23:43.0399244Z 2025-05-07T20:23:43.0399400Z Downloading and Extracting Packages: ...working... 2025-05-07T20:23:43.0399771Z conda-25.3.1 | 1.1 MB | | 0% 2025-05-07T20:23:43.0400007Z 2025-05-07T20:23:43.0400573Z certifi-2025.4.26 | 154 KB | | 0%  2025-05-07T20:23:43.0400831Z 2025-05-07T20:23:43.0400835Z 2025-05-07T20:23:43.0404664Z ca-certificates-2025 | 149 KB | | 0%  2025-05-07T20:23:43.0404941Z 2025-05-07T20:23:43.0405142Z 2025-05-07T20:23:43.0412585Z 2025-05-07T20:23:43.0927233Z conda-libmamba-solve | 41 KB | | 0%  2025-05-07T20:23:43.0927549Z 2025-05-07T20:23:43.0927554Z 2025-05-07T20:23:43.0927558Z 2025-05-07T20:23:43.1033930Z conda-libmamba-solve | 41 KB | ########## | 100%  2025-05-07T20:23:43.1034354Z 2025-05-07T20:23:43.1034360Z 2025-05-07T20:23:43.1034366Z 2025-05-07T20:23:43.1065548Z conda-libmamba-solve | 41 KB | ########## | 100%  2025-05-07T20:23:43.1066008Z 2025-05-07T20:23:43.1066015Z 2025-05-07T20:23:43.1139073Z ca-certificates-2025 | 149 KB | ########## | 100%  2025-05-07T20:23:43.1139353Z 2025-05-07T20:23:43.1139360Z 2025-05-07T20:23:43.1226768Z ca-certificates-2025 | 149 KB | ########## | 100%  2025-05-07T20:23:43.1227285Z 2025-05-07T20:23:43.1332472Z certifi-2025.4.26 | 154 KB | ########## | 100%  2025-05-07T20:23:43.1332751Z 2025-05-07T20:23:43.1486962Z certifi-2025.4.26 | 154 KB | ########## | 100%  2025-05-07T20:23:43.1503857Z conda-25.3.1 | 1.1 MB | ########4 | 85% 2025-05-07T20:23:43.3183194Z conda-25.3.1 | 1.1 MB | ########## | 100% 2025-05-07T20:23:43.3187849Z conda-25.3.1 | 1.1 MB | ########## | 100% 2025-05-07T20:23:43.3188191Z 2025-05-07T20:23:43.3188402Z 2025-05-07T20:23:43.3188751Z  2025-05-07T20:23:43.3189002Z 2025-05-07T20:23:43.3189008Z 2025-05-07T20:23:43.3189256Z  2025-05-07T20:23:43.3189492Z 2025-05-07T20:23:43.3189495Z 2025-05-07T20:23:43.3189506Z 2025-05-07T20:23:43.3189687Z  done 2025-05-07T20:23:43.4192039Z Preparing transaction: - done 2025-05-07T20:23:43.5194425Z Verifying transaction: | done 2025-05-07T20:23:44.8212909Z Executing transaction: - \ | / - \ | / - \ | / - done 2025-05-07T20:23:46.5878273Z [SETUP] Updating Miniconda base packages ... 2025-05-07T20:23:46.5908121Z [EXEC] [ATTEMPT 0/3] + conda update -n base -c defaults --update-deps -y conda 2025-05-07T20:23:47.5123049Z Channels: 2025-05-07T20:23:47.5123331Z - defaults 2025-05-07T20:23:47.5123537Z Platform: linux-64 2025-05-07T20:23:48.7650537Z Collecting package metadata (repodata.json): - \ | / - \ | done 2025-05-07T20:23:48.8866929Z Solving environment: - \ Channels: 2025-05-07T20:23:48.8867404Z - defaults 2025-05-07T20:23:48.8867716Z Platform: linux-64 2025-05-07T20:23:49.1752759Z Collecting package metadata (repodata.json): / - \ done 2025-05-07T20:23:49.3908901Z Solving environment: / - \ | done 2025-05-07T20:23:49.4736637Z done 2025-05-07T20:23:49.5396813Z 2025-05-07T20:23:49.5397356Z ## Package Plan ## 2025-05-07T20:23:49.5397532Z 2025-05-07T20:23:49.5397682Z environment location: /home/ec2-user/miniconda 2025-05-07T20:23:49.5397934Z 2025-05-07T20:23:49.5398030Z added / updated specs: 2025-05-07T20:23:49.5398294Z - conda 2025-05-07T20:23:49.5398415Z 2025-05-07T20:23:49.5398419Z 2025-05-07T20:23:49.5398543Z The following packages will be downloaded: 2025-05-07T20:23:49.5398765Z 2025-05-07T20:23:49.5398880Z package | build 2025-05-07T20:23:49.5399212Z ---------------------------|----------------- 2025-05-07T20:23:49.5399576Z pip-25.1 | pyhc872135_2 1.3 MB 2025-05-07T20:23:49.5399977Z tzdata-2025b | h04d1e81_0 116 KB 2025-05-07T20:23:49.5400357Z ------------------------------------------------------------ 2025-05-07T20:23:49.5400703Z Total: 1.4 MB 2025-05-07T20:23:49.5400919Z 2025-05-07T20:23:49.5401391Z The following packages will be UPDATED: 2025-05-07T20:23:49.5401609Z 2025-05-07T20:23:49.5401925Z pip pkgs/main/linux-64::pip-25.0-py313h06~ --> pkgs/main/noarch::pip-25.1-pyhc872135_2 2025-05-07T20:23:49.5402456Z tzdata 2025a-h04d1e81_0 --> 2025b-h04d1e81_0 2025-05-07T20:23:49.5402724Z 2025-05-07T20:23:49.5402728Z 2025-05-07T20:23:49.5402732Z 2025-05-07T20:23:49.5402874Z Downloading and Extracting Packages: ...working... 2025-05-07T20:23:49.5403250Z pip-25.1 | 1.3 MB | | 0% 2025-05-07T20:23:49.5404360Z 2025-05-07T20:23:49.5742283Z tzdata-2025b | 116 KB | | 0%  2025-05-07T20:23:49.6024941Z pip-25.1 | 1.3 MB | ########## | 100% 2025-05-07T20:23:49.6027162Z 2025-05-07T20:23:49.7642725Z tzdata-2025b | 116 KB | ########## | 100%  2025-05-07T20:23:49.7645092Z pip-25.1 | 1.3 MB | ########## | 100% 2025-05-07T20:23:49.8057339Z pip-25.1 | 1.3 MB | ########## | 100% 2025-05-07T20:23:49.8057794Z 2025-05-07T20:23:49.8058264Z tzdata-2025b | 116 KB | ########## | 100%  2025-05-07T20:23:49.8058709Z 2025-05-07T20:23:49.8064558Z tzdata-2025b | 116 KB | ########## | 100%  2025-05-07T20:23:49.8065615Z 2025-05-07T20:23:49.8065991Z 2025-05-07T20:23:49.8066300Z  done 2025-05-07T20:23:49.9071101Z Preparing transaction: - done 2025-05-07T20:23:50.0076825Z Verifying transaction: | done 2025-05-07T20:23:52.3114297Z Executing transaction: - \ | / - \ | / - \ | / - \ | / - \ | / - \ | done 2025-05-07T20:23:52.9227713Z [SETUP] Cleaning up Conda packages ... 2025-05-07T20:23:52.9232204Z + conda clean --packages --tarball -y 2025-05-07T20:23:52.9232421Z 2025-05-07T20:23:53.9306751Z Will remove 99 (117.8 MB) tarball(s). 2025-05-07T20:23:53.9307099Z Will remove 11 (16.0 MB) package(s). 2025-05-07T20:23:53.9943624Z 2025-05-07T20:23:53.9952990Z + conda clean --all -y 2025-05-07T20:23:53.9953171Z 2025-05-07T20:23:54.5281322Z There are no unused tarball(s) to remove. 2025-05-07T20:23:54.5281721Z Will remove 1 index cache(s). 2025-05-07T20:23:54.5282007Z There are no unused package(s) to remove. 2025-05-07T20:23:54.5282330Z There are no tempfile(s) to remove. 2025-05-07T20:23:54.5282625Z There are no logfile(s) to remove. 2025-05-07T20:23:54.5916913Z 2025-05-07T20:23:54.5921961Z + conda info 2025-05-07T20:23:54.5922132Z 2025-05-07T20:23:55.3594468Z 2025-05-07T20:23:55.3595035Z active environment : base 2025-05-07T20:23:55.3595507Z active env location : /home/ec2-user/miniconda 2025-05-07T20:23:55.3595934Z shell level : 1 2025-05-07T20:23:55.3596316Z user config file : /home/ec2-user/.condarc 2025-05-07T20:23:55.3596797Z populated config files : /home/ec2-user/miniconda/.condarc 2025-05-07T20:23:55.3597188Z conda version : 25.3.1 2025-05-07T20:23:55.3597504Z conda-build version : not installed 2025-05-07T20:23:55.3597806Z python version : 3.13.2.final.0 2025-05-07T20:23:55.3598118Z solver : libmamba (default) 2025-05-07T20:23:55.3598449Z virtual packages : __archspec=1=zen2 2025-05-07T20:23:55.3598757Z __conda=25.3.1=0 2025-05-07T20:23:55.3599041Z __cuda=12.8=0 2025-05-07T20:23:55.3599325Z __glibc=2.34=0 2025-05-07T20:23:55.3599611Z __linux=6.1.130=0 2025-05-07T20:23:55.3599887Z __unix=0=0 2025-05-07T20:23:55.3600233Z base environment : /home/ec2-user/miniconda (writable) 2025-05-07T20:23:55.3600658Z conda av data dir : /home/ec2-user/miniconda/etc/conda 2025-05-07T20:23:55.3601020Z conda av metadata url : None 2025-05-07T20:23:55.3601397Z channel URLs : https://repo.anaconda.com/pkgs/main/linux-64 2025-05-07T20:23:55.3602994Z https://repo.anaconda.com/pkgs/main/noarch 2025-05-07T20:23:55.3603400Z https://repo.anaconda.com/pkgs/r/linux-64 2025-05-07T20:23:55.3603780Z https://repo.anaconda.com/pkgs/r/noarch 2025-05-07T20:23:55.3604160Z package cache : /home/ec2-user/miniconda/pkgs 2025-05-07T20:23:55.3604506Z /home/ec2-user/.conda/pkgs 2025-05-07T20:23:55.3604846Z envs directories : /home/ec2-user/miniconda/envs 2025-05-07T20:23:55.3605191Z /home/ec2-user/.conda/envs 2025-05-07T20:23:55.3605498Z platform : linux-64 2025-05-07T20:23:55.3606385Z user-agent : conda/25.3.1 requests/2.32.3 CPython/3.13.2 Linux/6.1.130-139.222.amzn2023.x86_64 amzn/2023.6.20250317 glibc/2.34 solver/libmamba conda-libmamba-solver/25.4.0 libmambapy/2.0.5 aau/0.7.0 c/. s/. e/. 2025-05-07T20:23:55.3607264Z UID:GID : 1000:1000 2025-05-07T20:23:55.3607537Z netrc file : None 2025-05-07T20:23:55.3607806Z offline mode : False 2025-05-07T20:23:55.3607975Z 2025-05-07T20:23:55.4258627Z 2025-05-07T20:23:55.4258967Z [SETUP] Exporting Miniconda variables ... 2025-05-07T20:23:55.4260096Z [SETUP] Saving Miniconda variables to /home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/add_path_53de9a79-c4b6-4b66-9cfe-ac216a3e2536 ... 2025-05-07T20:23:55.4261578Z [SETUP] Successfully set up Miniconda at /home/ec2-user/miniconda 2025-05-07T20:23:55.4338043Z ##[group]Run . $PRELUDE; create_conda_environment $BUILD_ENV 3.9 2025-05-07T20:23:55.4338543Z . $PRELUDE; create_conda_environment $BUILD_ENV 3.9 2025-05-07T20:23:55.4357745Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:23:55.4358103Z env: 2025-05-07T20:23:55.4358319Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:23:55.4358625Z BUILD_ENV: build_binary 2025-05-07T20:23:55.4358869Z BUILD_TARGET: genai 2025-05-07T20:23:55.4359094Z BUILD_VARIANT: cuda 2025-05-07T20:23:55.4359318Z BUILD_CUDA_VERSION: 12.8.0 2025-05-07T20:23:55.4359571Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:23:55.4359869Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:23:55.4360197Z ##[endgroup] 2025-05-07T20:23:55.7692407Z ################################################################################ 2025-05-07T20:23:55.7692791Z # Create Conda Environment 2025-05-07T20:23:55.7693036Z # 2025-05-07T20:23:55.7709504Z # [2025-05-07T20:23:55.770Z] + create_conda_environment build_binary 3.9 2025-05-07T20:23:55.7710047Z ################################################################################ 2025-05-07T20:23:55.7710274Z 2025-05-07T20:23:55.7727882Z [EXEC] [ATTEMPT 0/3] + wget -q --timeout 1 pypi.org -O /dev/null 2025-05-07T20:23:55.8622947Z [CHECK] Network does not appear to be blocked. 2025-05-07T20:23:55.8623760Z [SETUP] Listing existing Conda environments ... 2025-05-07T20:23:55.8624408Z + conda info --envs 2025-05-07T20:23:55.8624696Z 2025-05-07T20:23:56.6079366Z 2025-05-07T20:23:56.6079728Z # conda environments: 2025-05-07T20:23:56.6079994Z # 2025-05-07T20:23:56.6080222Z base /home/ec2-user/miniconda 2025-05-07T20:23:56.6080466Z 2025-05-07T20:23:56.6735475Z 2025-05-07T20:23:56.6735813Z [SETUP] Deleting the prefix directory if it exists ... 2025-05-07T20:23:58.3090274Z + rm -rf /home/ec2-user/miniconda/envs/build_binary 2025-05-07T20:23:58.3090592Z 2025-05-07T20:23:58.3103426Z 2025-05-07T20:23:58.3112814Z [SETUP] Creating new Conda environment (Python 3.9) ... 2025-05-07T20:23:58.3135521Z [EXEC] [ATTEMPT 0/3] + conda create -y -n build_binary python=3.9 2025-05-07T20:23:59.0633712Z Channels: 2025-05-07T20:23:59.0634057Z - defaults 2025-05-07T20:23:59.0634336Z Platform: linux-64 2025-05-07T20:24:00.6098158Z Collecting package metadata (repodata.json): - \ | / - \ | / - \ done 2025-05-07T20:24:00.7101938Z Solving environment: / done 2025-05-07T20:24:00.7387715Z 2025-05-07T20:24:00.7387946Z ## Package Plan ## 2025-05-07T20:24:00.7388094Z 2025-05-07T20:24:00.7388581Z environment location: /home/ec2-user/miniconda/envs/build_binary 2025-05-07T20:24:00.7388989Z 2025-05-07T20:24:00.7389145Z added / updated specs: 2025-05-07T20:24:00.7389502Z - python=3.9 2025-05-07T20:24:00.7389712Z 2025-05-07T20:24:00.7389718Z 2025-05-07T20:24:00.7390060Z The following packages will be downloaded: 2025-05-07T20:24:00.7390348Z 2025-05-07T20:24:00.7390510Z package | build 2025-05-07T20:24:00.7390855Z ---------------------------|----------------- 2025-05-07T20:24:00.7391228Z _libgcc_mutex-0.1 | main 3 KB 2025-05-07T20:24:00.7391643Z _openmp_mutex-5.1 | 1_gnu 21 KB 2025-05-07T20:24:00.7392071Z ca-certificates-2025.2.25 | h06a4308_0 129 KB 2025-05-07T20:24:00.7392503Z python-3.9.21 | he870216_1 25.1 MB 2025-05-07T20:24:00.7392920Z setuptools-78.1.1 | py39h06a4308_0 1.7 MB 2025-05-07T20:24:00.7393330Z wheel-0.45.1 | py39h06a4308_0 114 KB 2025-05-07T20:24:00.7393704Z ------------------------------------------------------------ 2025-05-07T20:24:00.7394049Z Total: 27.1 MB 2025-05-07T20:24:00.7394667Z 2025-05-07T20:24:00.7394807Z The following NEW packages will be INSTALLED: 2025-05-07T20:24:00.7395033Z 2025-05-07T20:24:00.7395438Z _libgcc_mutex pkgs/main/linux-64::_libgcc_mutex-0.1-main 2025-05-07T20:24:00.7395897Z _openmp_mutex pkgs/main/linux-64::_openmp_mutex-5.1-1_gnu 2025-05-07T20:24:00.7396426Z ca-certificates pkgs/main/linux-64::ca-certificates-2025.2.25-h06a4308_0 2025-05-07T20:24:00.7396986Z ld_impl_linux-64 pkgs/main/linux-64::ld_impl_linux-64-2.40-h12ee557_0 2025-05-07T20:24:00.7397457Z libffi pkgs/main/linux-64::libffi-3.4.4-h6a678d5_1 2025-05-07T20:24:00.7397900Z libgcc-ng pkgs/main/linux-64::libgcc-ng-11.2.0-h1234567_1 2025-05-07T20:24:00.7398352Z libgomp pkgs/main/linux-64::libgomp-11.2.0-h1234567_1 2025-05-07T20:24:00.7398833Z libstdcxx-ng pkgs/main/linux-64::libstdcxx-ng-11.2.0-h1234567_1 2025-05-07T20:24:00.7399295Z ncurses pkgs/main/linux-64::ncurses-6.4-h6a678d5_0 2025-05-07T20:24:00.7399732Z openssl pkgs/main/linux-64::openssl-3.0.16-h5eee18b_0 2025-05-07T20:24:00.7400154Z pip pkgs/main/noarch::pip-25.1-pyhc872135_2 2025-05-07T20:24:00.7400570Z python pkgs/main/linux-64::python-3.9.21-he870216_1 2025-05-07T20:24:00.7401003Z readline pkgs/main/linux-64::readline-8.2-h5eee18b_0 2025-05-07T20:24:00.7401493Z setuptools pkgs/main/linux-64::setuptools-78.1.1-py39h06a4308_0 2025-05-07T20:24:00.7401975Z sqlite pkgs/main/linux-64::sqlite-3.45.3-h5eee18b_0 2025-05-07T20:24:00.7402380Z tk pkgs/main/linux-64::tk-8.6.14-h39e8969_0 2025-05-07T20:24:00.7402770Z tzdata pkgs/main/noarch::tzdata-2025b-h04d1e81_0 2025-05-07T20:24:00.7403199Z wheel pkgs/main/linux-64::wheel-0.45.1-py39h06a4308_0 2025-05-07T20:24:00.7403606Z xz pkgs/main/linux-64::xz-5.6.4-h5eee18b_1 2025-05-07T20:24:00.7403983Z zlib pkgs/main/linux-64::zlib-1.2.13-h5eee18b_1 2025-05-07T20:24:00.7404251Z 2025-05-07T20:24:00.7404255Z 2025-05-07T20:24:00.7404259Z 2025-05-07T20:24:00.7404404Z Downloading and Extracting Packages: ...working... 2025-05-07T20:24:00.7404796Z python-3.9.21 | 25.1 MB | | 0% 2025-05-07T20:24:00.7405029Z 2025-05-07T20:24:00.7405382Z setuptools-78.1.1 | 1.7 MB | | 0%  2025-05-07T20:24:00.7405638Z 2025-05-07T20:24:00.7405641Z 2025-05-07T20:24:00.7408975Z ca-certificates-2025 | 129 KB | | 0%  2025-05-07T20:24:00.7409272Z 2025-05-07T20:24:00.7409276Z 2025-05-07T20:24:00.7414262Z 2025-05-07T20:24:00.7435566Z wheel-0.45.1 | 114 KB | | 0%  2025-05-07T20:24:00.7435842Z 2025-05-07T20:24:00.7435846Z 2025-05-07T20:24:00.7435850Z 2025-05-07T20:24:00.7439820Z 2025-05-07T20:24:00.7452635Z _openmp_mutex-5.1 | 21 KB | | 0%  2025-05-07T20:24:00.7452991Z 2025-05-07T20:24:00.7452995Z 2025-05-07T20:24:00.7452999Z 2025-05-07T20:24:00.7453003Z 2025-05-07T20:24:00.7459492Z 2025-05-07T20:24:00.7858345Z _libgcc_mutex-0.1 | 3 KB | | 0%  2025-05-07T20:24:00.7858676Z 2025-05-07T20:24:00.7858683Z 2025-05-07T20:24:00.7858939Z 2025-05-07T20:24:00.7881717Z wheel-0.45.1 | 114 KB | ########## | 100%  2025-05-07T20:24:00.7882175Z 2025-05-07T20:24:00.7882181Z 2025-05-07T20:24:00.7882187Z 2025-05-07T20:24:00.7882390Z 2025-05-07T20:24:00.8215282Z _openmp_mutex-5.1 | 21 KB | ########## | 100%  2025-05-07T20:24:00.8215678Z 2025-05-07T20:24:00.8215684Z 2025-05-07T20:24:00.8215690Z 2025-05-07T20:24:00.8215695Z 2025-05-07T20:24:00.8215700Z 2025-05-07T20:24:00.8276948Z _libgcc_mutex-0.1 | 3 KB | ########## | 100%  2025-05-07T20:24:00.8277246Z 2025-05-07T20:24:00.8280599Z 2025-05-07T20:24:00.8394892Z ca-certificates-2025 | 129 KB | ########## | 100%  2025-05-07T20:24:00.8406539Z python-3.9.21 | 25.1 MB | 4 | 5% 2025-05-07T20:24:00.8406866Z 2025-05-07T20:24:00.8461175Z setuptools-78.1.1 | 1.7 MB | 9 | 9%  2025-05-07T20:24:00.8461508Z 2025-05-07T20:24:00.8461732Z 2025-05-07T20:24:00.8461738Z 2025-05-07T20:24:00.8461742Z 2025-05-07T20:24:00.8464100Z 2025-05-07T20:24:00.8822176Z _libgcc_mutex-0.1 | 3 KB | ########## | 100%  2025-05-07T20:24:00.8822612Z 2025-05-07T20:24:00.8822617Z 2025-05-07T20:24:00.8822621Z 2025-05-07T20:24:00.8822625Z 2025-05-07T20:24:00.8829768Z _openmp_mutex-5.1 | 21 KB | ########## | 100%  2025-05-07T20:24:00.8830333Z 2025-05-07T20:24:00.8830339Z 2025-05-07T20:24:00.8830345Z 2025-05-07T20:24:00.8834304Z 2025-05-07T20:24:00.8901241Z _openmp_mutex-5.1 | 21 KB | ########## | 100%  2025-05-07T20:24:00.8901542Z 2025-05-07T20:24:00.8901546Z 2025-05-07T20:24:00.8905673Z ca-certificates-2025 | 129 KB | ########## | 100%  2025-05-07T20:24:00.8906172Z 2025-05-07T20:24:00.8906180Z 2025-05-07T20:24:00.8931878Z ca-certificates-2025 | 129 KB | ########## | 100%  2025-05-07T20:24:00.8932260Z 2025-05-07T20:24:00.9221967Z setuptools-78.1.1 | 1.7 MB | ########## | 100%  2025-05-07T20:24:00.9222257Z 2025-05-07T20:24:00.9222261Z 2025-05-07T20:24:00.9223926Z 2025-05-07T20:24:00.9227391Z wheel-0.45.1 | 114 KB | ########## | 100%  2025-05-07T20:24:00.9227651Z 2025-05-07T20:24:00.9227656Z 2025-05-07T20:24:00.9227988Z 2025-05-07T20:24:00.9394929Z wheel-0.45.1 | 114 KB | ########## | 100%  2025-05-07T20:24:01.0396397Z python-3.9.21 | 25.1 MB | ###3 | 34% 2025-05-07T20:24:01.1034168Z python-3.9.21 | 25.1 MB | #########5 | 95% 2025-05-07T20:24:01.3537720Z python-3.9.21 | 25.1 MB | ########## | 100% 2025-05-07T20:24:01.3538082Z 2025-05-07T20:24:01.3541339Z setuptools-78.1.1 | 1.7 MB | ########## | 100%  2025-05-07T20:24:01.3541729Z 2025-05-07T20:24:01.8194566Z setuptools-78.1.1 | 1.7 MB | ########## | 100%  2025-05-07T20:24:01.8200718Z python-3.9.21 | 25.1 MB | ########## | 100% 2025-05-07T20:24:01.8201269Z 2025-05-07T20:24:01.8201481Z 2025-05-07T20:24:01.8201696Z  2025-05-07T20:24:01.8201908Z 2025-05-07T20:24:01.8201913Z 2025-05-07T20:24:01.8202079Z  2025-05-07T20:24:01.8202300Z 2025-05-07T20:24:01.8202304Z 2025-05-07T20:24:01.8202307Z 2025-05-07T20:24:01.8202480Z  2025-05-07T20:24:01.8202703Z 2025-05-07T20:24:01.8202707Z 2025-05-07T20:24:01.8202711Z 2025-05-07T20:24:01.8202719Z 2025-05-07T20:24:01.8202892Z  2025-05-07T20:24:01.8203127Z 2025-05-07T20:24:01.8203131Z 2025-05-07T20:24:01.8203134Z 2025-05-07T20:24:01.8203148Z 2025-05-07T20:24:01.8203152Z 2025-05-07T20:24:01.8203330Z  done 2025-05-07T20:24:02.0309166Z Preparing transaction: \ | done 2025-05-07T20:24:03.1643382Z Verifying transaction: - \ | / - \ | / - \ | done 2025-05-07T20:24:05.3827612Z Executing transaction: - \ | / - \ | / - \ | / - \ | / - \ | / - \ done 2025-05-07T20:24:05.4331389Z # 2025-05-07T20:24:05.4331988Z # To activate this environment, use 2025-05-07T20:24:05.4332762Z # 2025-05-07T20:24:05.4333298Z # $ conda activate build_binary 2025-05-07T20:24:05.4334008Z # 2025-05-07T20:24:05.4334421Z # To deactivate an active environment, use 2025-05-07T20:24:05.4334993Z # 2025-05-07T20:24:05.4335348Z # $ conda deactivate 2025-05-07T20:24:05.4335746Z 2025-05-07T20:24:05.5413569Z [SETUP] Upgrading PIP to latest ... 2025-05-07T20:24:05.5435415Z [EXEC] [ATTEMPT 0/3] + conda run -n build_binary pip install --upgrade pip 2025-05-07T20:24:08.3799329Z Requirement already satisfied: pip in /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages (25.1) 2025-05-07T20:24:08.3800251Z Collecting pip 2025-05-07T20:24:08.3801206Z Downloading pip-25.1.1-py3-none-any.whl.metadata (3.6 kB) 2025-05-07T20:24:08.3801827Z Downloading pip-25.1.1-py3-none-any.whl (1.8 MB) 2025-05-07T20:24:08.3802971Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.8/1.8 MB 115.0 MB/s eta 0:00:00 2025-05-07T20:24:08.3803494Z Installing collected packages: pip 2025-05-07T20:24:08.3803938Z Attempting uninstall: pip 2025-05-07T20:24:08.3804336Z Found existing installation: pip 25.1 2025-05-07T20:24:08.3804792Z Uninstalling pip-25.1: 2025-05-07T20:24:08.3805186Z Successfully uninstalled pip-25.1 2025-05-07T20:24:08.3805633Z Successfully installed pip-25.1.1 2025-05-07T20:24:08.3805915Z 2025-05-07T20:24:08.4438367Z [SETUP] Upgrading pyOpenSSL ... 2025-05-07T20:24:08.4460792Z [EXEC] [ATTEMPT 0/3] + conda install -n build_binary -c conda-forge --override-channels -y pyOpenSSL>22.1.0 2025-05-07T20:24:09.2981726Z Channels: 2025-05-07T20:24:09.2982167Z - conda-forge 2025-05-07T20:24:09.2982589Z Platform: linux-64 2025-05-07T20:24:19.7144754Z Collecting package metadata (repodata.json): - \ | / - \ | / - \ | / - \ | / - \ | done 2025-05-07T20:24:21.2218952Z Solving environment: - \ | / - done 2025-05-07T20:24:21.2817289Z 2025-05-07T20:24:21.2817809Z ## Package Plan ## 2025-05-07T20:24:21.2818035Z 2025-05-07T20:24:21.2818342Z environment location: /home/ec2-user/miniconda/envs/build_binary 2025-05-07T20:24:21.2818759Z 2025-05-07T20:24:21.2818859Z added / updated specs: 2025-05-07T20:24:21.2819142Z - pyopenssl[version='>22.1.0'] 2025-05-07T20:24:21.2819348Z 2025-05-07T20:24:21.2819352Z 2025-05-07T20:24:21.2819475Z The following packages will be downloaded: 2025-05-07T20:24:21.2819702Z 2025-05-07T20:24:21.2819825Z package | build 2025-05-07T20:24:21.2820183Z ---------------------------|----------------- 2025-05-07T20:24:21.2820582Z cffi-1.17.1 | py39h15c3d72_0 236 KB conda-forge 2025-05-07T20:24:21.2821075Z cryptography-44.0.3 | py39h7170ec2_0 1.5 MB conda-forge 2025-05-07T20:24:21.2821550Z libgcc-15.1.0 | h767d61c_2 810 KB conda-forge 2025-05-07T20:24:21.2821983Z libgcc-ng-15.1.0 | h69a702a_2 34 KB conda-forge 2025-05-07T20:24:21.2822425Z libgomp-15.1.0 | h767d61c_2 442 KB conda-forge 2025-05-07T20:24:21.2822864Z openssl-3.5.0 | h7b32b05_1 3.0 MB conda-forge 2025-05-07T20:24:21.2823311Z pycparser-2.22 | pyh29332c3_1 108 KB conda-forge 2025-05-07T20:24:21.2823772Z pyopenssl-25.0.0 | pyhd8ed1ab_0 120 KB conda-forge 2025-05-07T20:24:21.2824227Z python_abi-3.9 | 2_cp39 4 KB conda-forge 2025-05-07T20:24:21.2824714Z typing-extensions-4.13.2 | h0e9735f_0 88 KB conda-forge 2025-05-07T20:24:21.2825229Z typing_extensions-4.13.2 | pyh29332c3_0 51 KB conda-forge 2025-05-07T20:24:21.2825684Z ------------------------------------------------------------ 2025-05-07T20:24:21.2826047Z Total: 6.3 MB 2025-05-07T20:24:21.2826269Z 2025-05-07T20:24:21.2826407Z The following NEW packages will be INSTALLED: 2025-05-07T20:24:21.2826639Z 2025-05-07T20:24:21.2826838Z cffi conda-forge/linux-64::cffi-1.17.1-py39h15c3d72_0 2025-05-07T20:24:21.2827359Z cryptography conda-forge/linux-64::cryptography-44.0.3-py39h7170ec2_0 2025-05-07T20:24:21.2827893Z libgcc conda-forge/linux-64::libgcc-15.1.0-h767d61c_2 2025-05-07T20:24:21.2828371Z pycparser conda-forge/noarch::pycparser-2.22-pyh29332c3_1 2025-05-07T20:24:21.2828871Z pyopenssl conda-forge/noarch::pyopenssl-25.0.0-pyhd8ed1ab_0 2025-05-07T20:24:21.2829707Z python_abi conda-forge/linux-64::python_abi-3.9-2_cp39 2025-05-07T20:24:21.2830401Z typing-extensions conda-forge/noarch::typing-extensions-4.13.2-h0e9735f_0 2025-05-07T20:24:21.2831187Z typing_extensions conda-forge/noarch::typing_extensions-4.13.2-pyh29332c3_0 2025-05-07T20:24:21.2831554Z 2025-05-07T20:24:21.2831676Z The following packages will be UPDATED: 2025-05-07T20:24:21.2831893Z 2025-05-07T20:24:21.2832484Z ca-certificates pkgs/main/linux-64::ca-certificates-2~ --> conda-forge/noarch::ca-certificates-2025.4.26-hbd8a1cb_0 2025-05-07T20:24:21.2833297Z libgcc-ng pkgs/main::libgcc-ng-11.2.0-h1234567_1 --> conda-forge::libgcc-ng-15.1.0-h69a702a_2 2025-05-07T20:24:21.2833987Z libgomp pkgs/main::libgomp-11.2.0-h1234567_1 --> conda-forge::libgomp-15.1.0-h767d61c_2 2025-05-07T20:24:21.2834661Z openssl pkgs/main::openssl-3.0.16-h5eee18b_0 --> conda-forge::openssl-3.5.0-h7b32b05_1 2025-05-07T20:24:21.2835105Z 2025-05-07T20:24:21.2835109Z 2025-05-07T20:24:21.2835113Z 2025-05-07T20:24:21.2835268Z Downloading and Extracting Packages: ...working... 2025-05-07T20:24:21.2835656Z openssl-3.5.0 | 3.0 MB | | 0% 2025-05-07T20:24:21.2835907Z 2025-05-07T20:24:21.2836294Z cryptography-44.0.3 | 1.5 MB | | 0%  2025-05-07T20:24:21.2836564Z 2025-05-07T20:24:21.2836567Z 2025-05-07T20:24:21.2842446Z libgcc-15.1.0 | 810 KB | | 0%  2025-05-07T20:24:21.2842699Z 2025-05-07T20:24:21.2842703Z 2025-05-07T20:24:21.2847134Z 2025-05-07T20:24:21.2866541Z libgomp-15.1.0 | 442 KB | | 0%  2025-05-07T20:24:21.2866801Z 2025-05-07T20:24:21.2866977Z 2025-05-07T20:24:21.2866980Z 2025-05-07T20:24:21.2867109Z 2025-05-07T20:24:21.2876928Z cffi-1.17.1 | 236 KB | | 0%  2025-05-07T20:24:21.2877273Z 2025-05-07T20:24:21.2877277Z 2025-05-07T20:24:21.2877281Z 2025-05-07T20:24:21.2877300Z 2025-05-07T20:24:21.2880486Z 2025-05-07T20:24:21.2881928Z pyopenssl-25.0.0 | 120 KB | | 0%  2025-05-07T20:24:21.2882292Z 2025-05-07T20:24:21.2882302Z 2025-05-07T20:24:21.2882313Z 2025-05-07T20:24:21.2882326Z 2025-05-07T20:24:21.2882329Z 2025-05-07T20:24:21.2882333Z 2025-05-07T20:24:21.2884769Z pycparser-2.22 | 108 KB | | 0%  2025-05-07T20:24:21.2885171Z 2025-05-07T20:24:21.2885188Z 2025-05-07T20:24:21.2885193Z 2025-05-07T20:24:21.2885198Z 2025-05-07T20:24:21.2885203Z 2025-05-07T20:24:21.2885208Z 2025-05-07T20:24:21.2885213Z 2025-05-07T20:24:21.2889120Z typing-extensions-4. | 88 KB | | 0%  2025-05-07T20:24:21.2889559Z 2025-05-07T20:24:21.2889565Z 2025-05-07T20:24:21.2889570Z 2025-05-07T20:24:21.2889575Z 2025-05-07T20:24:21.2889580Z 2025-05-07T20:24:21.2889585Z 2025-05-07T20:24:21.2889590Z 2025-05-07T20:24:21.2889595Z 2025-05-07T20:24:21.2890147Z typing_extensions-4. | 51 KB | | 0%  2025-05-07T20:24:21.2890584Z 2025-05-07T20:24:21.2890595Z 2025-05-07T20:24:21.2890600Z 2025-05-07T20:24:21.2890606Z 2025-05-07T20:24:21.2890611Z 2025-05-07T20:24:21.2890616Z 2025-05-07T20:24:21.2890628Z 2025-05-07T20:24:21.2890633Z 2025-05-07T20:24:21.2890638Z 2025-05-07T20:24:21.2902983Z libgcc-ng-15.1.0 | 34 KB | | 0%  2025-05-07T20:24:21.2903388Z 2025-05-07T20:24:21.2903394Z 2025-05-07T20:24:21.2903399Z 2025-05-07T20:24:21.2903404Z 2025-05-07T20:24:21.2903409Z 2025-05-07T20:24:21.2903414Z 2025-05-07T20:24:21.2903419Z 2025-05-07T20:24:21.2903425Z 2025-05-07T20:24:21.2903430Z 2025-05-07T20:24:21.2906606Z 2025-05-07T20:24:21.3609893Z python_abi-3.9 | 4 KB | | 0%  2025-05-07T20:24:21.3610283Z 2025-05-07T20:24:21.3610289Z 2025-05-07T20:24:21.3610294Z 2025-05-07T20:24:21.3612245Z 2025-05-07T20:24:21.3821158Z cffi-1.17.1 | 236 KB | ########## | 100%  2025-05-07T20:24:21.3854370Z openssl-3.5.0 | 3.0 MB | ####2 | 43% 2025-05-07T20:24:21.3854685Z 2025-05-07T20:24:21.3855816Z 2025-05-07T20:24:21.3895800Z libgcc-15.1.0 | 810 KB | ##7 | 28%  2025-05-07T20:24:21.3896082Z 2025-05-07T20:24:21.3896086Z 2025-05-07T20:24:21.3896090Z 2025-05-07T20:24:21.3901893Z libgomp-15.1.0 | 442 KB | ########## | 100%  2025-05-07T20:24:21.3902191Z 2025-05-07T20:24:21.3902195Z 2025-05-07T20:24:21.3903412Z 2025-05-07T20:24:21.4049591Z libgomp-15.1.0 | 442 KB | ########## | 100%  2025-05-07T20:24:21.4050018Z 2025-05-07T20:24:21.4050025Z 2025-05-07T20:24:21.4050030Z 2025-05-07T20:24:21.4050036Z 2025-05-07T20:24:21.4050425Z 2025-05-07T20:24:21.4120737Z pyopenssl-25.0.0 | 120 KB | #3 | 13%  2025-05-07T20:24:21.4121187Z 2025-05-07T20:24:21.4121193Z 2025-05-07T20:24:21.4121198Z 2025-05-07T20:24:21.4121204Z 2025-05-07T20:24:21.4121227Z 2025-05-07T20:24:21.4189124Z pyopenssl-25.0.0 | 120 KB | ########## | 100%  2025-05-07T20:24:21.4189429Z 2025-05-07T20:24:21.4304500Z cryptography-44.0.3 | 1.5 MB | 1 | 1%  2025-05-07T20:24:21.4304783Z 2025-05-07T20:24:21.4304788Z 2025-05-07T20:24:21.4482511Z libgcc-15.1.0 | 810 KB | ########## | 100%  2025-05-07T20:24:21.4483088Z 2025-05-07T20:24:21.4483095Z 2025-05-07T20:24:21.4483100Z 2025-05-07T20:24:21.4483105Z 2025-05-07T20:24:21.4483119Z 2025-05-07T20:24:21.4484551Z 2025-05-07T20:24:21.4524420Z pycparser-2.22 | 108 KB | #4 | 15%  2025-05-07T20:24:21.4524729Z 2025-05-07T20:24:21.4524733Z 2025-05-07T20:24:21.4524743Z 2025-05-07T20:24:21.4524747Z 2025-05-07T20:24:21.4524750Z 2025-05-07T20:24:21.4527100Z 2025-05-07T20:24:21.4736356Z pycparser-2.22 | 108 KB | ########## | 100%  2025-05-07T20:24:21.4736737Z 2025-05-07T20:24:21.4736742Z 2025-05-07T20:24:21.4736746Z 2025-05-07T20:24:21.4736764Z 2025-05-07T20:24:21.4736768Z 2025-05-07T20:24:21.4736772Z 2025-05-07T20:24:21.4740197Z 2025-05-07T20:24:21.4853709Z typing-extensions-4. | 88 KB | #8 | 18%  2025-05-07T20:24:21.4854059Z 2025-05-07T20:24:21.4854064Z 2025-05-07T20:24:21.4854068Z 2025-05-07T20:24:21.4854071Z 2025-05-07T20:24:21.4854075Z 2025-05-07T20:24:21.4854078Z 2025-05-07T20:24:21.4858614Z 2025-05-07T20:24:21.4957358Z typing-extensions-4. | 88 KB | ########## | 100%  2025-05-07T20:24:21.4959514Z openssl-3.5.0 | 3.0 MB | ########## | 100% 2025-05-07T20:24:21.4981033Z openssl-3.5.0 | 3.0 MB | ########## | 100% 2025-05-07T20:24:21.4981288Z 2025-05-07T20:24:21.4981293Z 2025-05-07T20:24:21.4981297Z 2025-05-07T20:24:21.4981301Z 2025-05-07T20:24:21.4981304Z 2025-05-07T20:24:21.4981308Z 2025-05-07T20:24:21.4981319Z 2025-05-07T20:24:21.4981914Z 2025-05-07T20:24:21.5027756Z typing_extensions-4. | 51 KB | ###1 | 31%  2025-05-07T20:24:21.5028109Z 2025-05-07T20:24:21.5028114Z 2025-05-07T20:24:21.5028118Z 2025-05-07T20:24:21.5033928Z 2025-05-07T20:24:21.5037971Z cffi-1.17.1 | 236 KB | ########## | 100%  2025-05-07T20:24:21.5038245Z 2025-05-07T20:24:21.5038250Z 2025-05-07T20:24:21.5038254Z 2025-05-07T20:24:21.5038260Z 2025-05-07T20:24:21.5038311Z 2025-05-07T20:24:21.5038317Z 2025-05-07T20:24:21.5038320Z 2025-05-07T20:24:21.5040616Z 2025-05-07T20:24:21.5048871Z typing_extensions-4. | 51 KB | ########## | 100%  2025-05-07T20:24:21.5049226Z 2025-05-07T20:24:21.5049230Z 2025-05-07T20:24:21.5049234Z 2025-05-07T20:24:21.5049969Z 2025-05-07T20:24:21.5118227Z cffi-1.17.1 | 236 KB | ########## | 100%  2025-05-07T20:24:21.5118592Z 2025-05-07T20:24:21.5118596Z 2025-05-07T20:24:21.5118600Z 2025-05-07T20:24:21.5118603Z 2025-05-07T20:24:21.5118607Z 2025-05-07T20:24:21.5118611Z 2025-05-07T20:24:21.5118614Z 2025-05-07T20:24:21.5118875Z 2025-05-07T20:24:21.5118879Z 2025-05-07T20:24:21.5131089Z libgcc-ng-15.1.0 | 34 KB | ####7 | 47%  2025-05-07T20:24:21.5131441Z 2025-05-07T20:24:21.5131444Z 2025-05-07T20:24:21.5131448Z 2025-05-07T20:24:21.5131630Z 2025-05-07T20:24:21.5131635Z 2025-05-07T20:24:21.5141817Z pyopenssl-25.0.0 | 120 KB | ########## | 100%  2025-05-07T20:24:21.5142163Z 2025-05-07T20:24:21.5142167Z 2025-05-07T20:24:21.5142171Z 2025-05-07T20:24:21.5142175Z 2025-05-07T20:24:21.5142320Z 2025-05-07T20:24:21.5154253Z pyopenssl-25.0.0 | 120 KB | ########## | 100%  2025-05-07T20:24:21.5154566Z 2025-05-07T20:24:21.5154570Z 2025-05-07T20:24:21.5154574Z 2025-05-07T20:24:21.5154578Z 2025-05-07T20:24:21.5154581Z 2025-05-07T20:24:21.5154585Z 2025-05-07T20:24:21.5154588Z 2025-05-07T20:24:21.5154592Z 2025-05-07T20:24:21.5158253Z 2025-05-07T20:24:21.5189567Z libgcc-ng-15.1.0 | 34 KB | ########## | 100%  2025-05-07T20:24:21.5193021Z 2025-05-07T20:24:21.5196259Z cryptography-44.0.3 | 1.5 MB | ########## | 100%  2025-05-07T20:24:21.5199420Z 2025-05-07T20:24:21.5298279Z cryptography-44.0.3 | 1.5 MB | ########## | 100%  2025-05-07T20:24:21.5298562Z 2025-05-07T20:24:21.5298573Z 2025-05-07T20:24:21.5298577Z 2025-05-07T20:24:21.5298581Z 2025-05-07T20:24:21.5298585Z 2025-05-07T20:24:21.5298589Z 2025-05-07T20:24:21.5298592Z 2025-05-07T20:24:21.5298596Z 2025-05-07T20:24:21.5298599Z 2025-05-07T20:24:21.5298603Z 2025-05-07T20:24:21.5314045Z python_abi-3.9 | 4 KB | ########## | 100%  2025-05-07T20:24:21.5314348Z 2025-05-07T20:24:21.5314352Z 2025-05-07T20:24:21.5314357Z 2025-05-07T20:24:21.5314361Z 2025-05-07T20:24:21.5314365Z 2025-05-07T20:24:21.5314368Z 2025-05-07T20:24:21.5314372Z 2025-05-07T20:24:21.5314376Z 2025-05-07T20:24:21.5314379Z 2025-05-07T20:24:21.5314383Z 2025-05-07T20:24:21.5402186Z python_abi-3.9 | 4 KB | ########## | 100%  2025-05-07T20:24:21.5402498Z 2025-05-07T20:24:21.5402502Z 2025-05-07T20:24:21.5402506Z 2025-05-07T20:24:21.5621797Z libgomp-15.1.0 | 442 KB | ########## | 100%  2025-05-07T20:24:21.5622095Z 2025-05-07T20:24:21.5622101Z 2025-05-07T20:24:21.5622104Z 2025-05-07T20:24:21.5622108Z 2025-05-07T20:24:21.5622112Z 2025-05-07T20:24:21.5622115Z 2025-05-07T20:24:21.5622119Z 2025-05-07T20:24:21.6014941Z typing-extensions-4. | 88 KB | ########## | 100%  2025-05-07T20:24:21.6015306Z 2025-05-07T20:24:21.6015312Z 2025-05-07T20:24:21.6018325Z libgcc-15.1.0 | 810 KB | ########## | 100%  2025-05-07T20:24:21.6018607Z 2025-05-07T20:24:21.6018613Z 2025-05-07T20:24:21.6189033Z libgcc-15.1.0 | 810 KB | ########## | 100%  2025-05-07T20:24:21.6189307Z 2025-05-07T20:24:21.6189311Z 2025-05-07T20:24:21.6189315Z 2025-05-07T20:24:21.6189319Z 2025-05-07T20:24:21.6189322Z 2025-05-07T20:24:21.6189327Z 2025-05-07T20:24:21.6189353Z 2025-05-07T20:24:21.6189545Z 2025-05-07T20:24:21.6193917Z typing_extensions-4. | 51 KB | ########## | 100%  2025-05-07T20:24:21.6194355Z 2025-05-07T20:24:21.6194363Z 2025-05-07T20:24:21.6194387Z 2025-05-07T20:24:21.6194392Z 2025-05-07T20:24:21.6194397Z 2025-05-07T20:24:21.6194403Z 2025-05-07T20:24:21.6194408Z 2025-05-07T20:24:21.6194413Z 2025-05-07T20:24:21.6404544Z typing_extensions-4. | 51 KB | ########## | 100%  2025-05-07T20:24:21.6404877Z 2025-05-07T20:24:21.6404882Z 2025-05-07T20:24:21.6404885Z 2025-05-07T20:24:21.6404889Z 2025-05-07T20:24:21.6404892Z 2025-05-07T20:24:21.6404896Z 2025-05-07T20:24:21.6407935Z pycparser-2.22 | 108 KB | ########## | 100%  2025-05-07T20:24:21.6408361Z 2025-05-07T20:24:21.6408368Z 2025-05-07T20:24:21.6408373Z 2025-05-07T20:24:21.6408378Z 2025-05-07T20:24:21.6408383Z 2025-05-07T20:24:21.6408388Z 2025-05-07T20:24:21.6522911Z pycparser-2.22 | 108 KB | ########## | 100%  2025-05-07T20:24:21.6523448Z 2025-05-07T20:24:21.6523452Z 2025-05-07T20:24:21.6523456Z 2025-05-07T20:24:21.6523460Z 2025-05-07T20:24:21.6523463Z 2025-05-07T20:24:21.6523467Z 2025-05-07T20:24:21.6523611Z 2025-05-07T20:24:21.6523616Z 2025-05-07T20:24:21.6523620Z 2025-05-07T20:24:21.6525996Z libgcc-ng-15.1.0 | 34 KB | ########## | 100%  2025-05-07T20:24:21.6526285Z 2025-05-07T20:24:21.6526289Z 2025-05-07T20:24:21.6526293Z 2025-05-07T20:24:21.6526297Z 2025-05-07T20:24:21.6526300Z 2025-05-07T20:24:21.6526304Z 2025-05-07T20:24:21.6526308Z 2025-05-07T20:24:21.6526311Z 2025-05-07T20:24:21.6526315Z 2025-05-07T20:24:21.6617726Z libgcc-ng-15.1.0 | 34 KB | ########## | 100%  2025-05-07T20:24:21.6618015Z 2025-05-07T20:24:21.6618019Z 2025-05-07T20:24:21.6618023Z 2025-05-07T20:24:21.6618026Z 2025-05-07T20:24:21.6618030Z 2025-05-07T20:24:21.6618034Z 2025-05-07T20:24:21.6618037Z 2025-05-07T20:24:21.6618050Z 2025-05-07T20:24:21.6618061Z 2025-05-07T20:24:21.6618065Z 2025-05-07T20:24:21.7365068Z python_abi-3.9 | 4 KB | ########## | 100%  2025-05-07T20:24:21.7743468Z openssl-3.5.0 | 3.0 MB | ########## | 100% 2025-05-07T20:24:21.7743724Z 2025-05-07T20:24:21.7753902Z cryptography-44.0.3 | 1.5 MB | ########## | 100%  2025-05-07T20:24:21.7754694Z 2025-05-07T20:24:21.7754942Z 2025-05-07T20:24:21.7755129Z  2025-05-07T20:24:21.7755354Z 2025-05-07T20:24:21.7755358Z 2025-05-07T20:24:21.7755524Z  2025-05-07T20:24:21.7755737Z 2025-05-07T20:24:21.7755740Z 2025-05-07T20:24:21.7755752Z 2025-05-07T20:24:21.7755919Z  2025-05-07T20:24:21.7756132Z 2025-05-07T20:24:21.7756136Z 2025-05-07T20:24:21.7756139Z 2025-05-07T20:24:21.7756152Z 2025-05-07T20:24:21.7756326Z  2025-05-07T20:24:21.7756547Z 2025-05-07T20:24:21.7756551Z 2025-05-07T20:24:21.7756555Z 2025-05-07T20:24:21.7756559Z 2025-05-07T20:24:21.7756567Z 2025-05-07T20:24:21.7756745Z  2025-05-07T20:24:21.7756964Z 2025-05-07T20:24:21.7756968Z 2025-05-07T20:24:21.7756971Z 2025-05-07T20:24:21.7756975Z 2025-05-07T20:24:21.7756979Z 2025-05-07T20:24:21.7756982Z 2025-05-07T20:24:21.7757163Z  2025-05-07T20:24:21.7757386Z 2025-05-07T20:24:21.7757390Z 2025-05-07T20:24:21.7757394Z 2025-05-07T20:24:21.7757397Z 2025-05-07T20:24:21.7757401Z 2025-05-07T20:24:21.7757404Z 2025-05-07T20:24:21.7757408Z 2025-05-07T20:24:21.7757592Z  2025-05-07T20:24:21.7757815Z 2025-05-07T20:24:21.7757818Z 2025-05-07T20:24:21.7757827Z 2025-05-07T20:24:21.7757830Z 2025-05-07T20:24:21.7757834Z 2025-05-07T20:24:21.7757837Z 2025-05-07T20:24:21.7757841Z 2025-05-07T20:24:21.7757845Z 2025-05-07T20:24:21.7758036Z  2025-05-07T20:24:21.7758261Z 2025-05-07T20:24:21.7758264Z 2025-05-07T20:24:21.7758268Z 2025-05-07T20:24:21.7758271Z 2025-05-07T20:24:21.7758275Z 2025-05-07T20:24:21.7758279Z 2025-05-07T20:24:21.7758282Z 2025-05-07T20:24:21.7758286Z 2025-05-07T20:24:21.7758289Z 2025-05-07T20:24:21.7758477Z  2025-05-07T20:24:21.7758704Z 2025-05-07T20:24:21.7758707Z 2025-05-07T20:24:21.7758711Z 2025-05-07T20:24:21.7758714Z 2025-05-07T20:24:21.7758718Z 2025-05-07T20:24:21.7758721Z 2025-05-07T20:24:21.7758725Z 2025-05-07T20:24:21.7758736Z 2025-05-07T20:24:21.7758739Z 2025-05-07T20:24:21.7758743Z 2025-05-07T20:24:21.7758944Z  done 2025-05-07T20:24:21.8759293Z Preparing transaction: | done 2025-05-07T20:24:21.9763615Z Verifying transaction: - done 2025-05-07T20:24:23.4791571Z Executing transaction: | / - \ | / - \ | / - \ | / - done 2025-05-07T20:24:23.6557607Z [SETUP] Testing pyOpenSSL import ... 2025-05-07T20:24:25.3783443Z [CHECK] Python (sub-)package 'OpenSSL' found ... 2025-05-07T20:24:25.3797067Z [SETUP] Installing libxcrypt ... 2025-05-07T20:24:25.3820045Z [EXEC] [ATTEMPT 0/3] + conda install -n build_binary -c conda-forge --override-channels -y libxcrypt 2025-05-07T20:24:26.2399315Z Channels: 2025-05-07T20:24:26.2399580Z - conda-forge 2025-05-07T20:24:26.2399811Z Platform: linux-64 2025-05-07T20:24:29.6227038Z Collecting package metadata (repodata.json): - \ | / done 2025-05-07T20:24:29.9947480Z Solving environment: \ done 2025-05-07T20:24:30.0549133Z 2025-05-07T20:24:30.0549568Z ## Package Plan ## 2025-05-07T20:24:30.0549721Z 2025-05-07T20:24:30.0550104Z environment location: /home/ec2-user/miniconda/envs/build_binary 2025-05-07T20:24:30.0550421Z 2025-05-07T20:24:30.0550598Z added / updated specs: 2025-05-07T20:24:30.0550844Z - libxcrypt 2025-05-07T20:24:30.0551004Z 2025-05-07T20:24:30.0551009Z 2025-05-07T20:24:30.0551127Z The following packages will be downloaded: 2025-05-07T20:24:30.0551347Z 2025-05-07T20:24:30.0551469Z package | build 2025-05-07T20:24:30.0551800Z ---------------------------|----------------- 2025-05-07T20:24:30.0552184Z libxcrypt-4.4.36 | hd590300_1 98 KB conda-forge 2025-05-07T20:24:30.0552601Z ------------------------------------------------------------ 2025-05-07T20:24:30.0552947Z Total: 98 KB 2025-05-07T20:24:30.0553162Z 2025-05-07T20:24:30.0553286Z The following NEW packages will be INSTALLED: 2025-05-07T20:24:30.0553515Z 2025-05-07T20:24:30.0553738Z libxcrypt conda-forge/linux-64::libxcrypt-4.4.36-hd590300_1 2025-05-07T20:24:30.0554046Z 2025-05-07T20:24:30.0554050Z 2025-05-07T20:24:30.0554053Z 2025-05-07T20:24:30.0554193Z Downloading and Extracting Packages: ...working... 2025-05-07T20:24:30.3114727Z libxcrypt-4.4.36 | 98 KB | | 0% 2025-05-07T20:24:30.3140214Z libxcrypt-4.4.36 | 98 KB | #6 | 16% 2025-05-07T20:24:30.3240380Z libxcrypt-4.4.36 | 98 KB | ########## | 100% 2025-05-07T20:24:30.3242874Z libxcrypt-4.4.36 | 98 KB | ########## | 100% 2025-05-07T20:24:30.3243228Z 2025-05-07T20:24:30.3243513Z done 2025-05-07T20:24:30.4248409Z Preparing transaction: / done 2025-05-07T20:24:30.5252651Z Verifying transaction: \ done 2025-05-07T20:24:30.6258305Z Executing transaction: / done 2025-05-07T20:24:34.0643844Z [SETUP] Copying over ... 2025-05-07T20:24:34.0644589Z + cp /home/ec2-user/miniconda/envs/build_binary/include/crypt.h /home/ec2-user/miniconda/envs/build_binary/include/python3.9/crypt.h 2025-05-07T20:24:34.0645204Z 2025-05-07T20:24:34.0672002Z 2025-05-07T20:24:35.7024699Z [SETUP] Installed Python version: Python 3.9.21 2025-05-07T20:24:35.7025189Z [SETUP] Successfully created Conda environment: build_binary 2025-05-07T20:24:35.7057896Z ##[group]Run . $PRELUDE; install_cxx_compiler $BUILD_ENV gcc 2025-05-07T20:24:35.7058362Z . $PRELUDE; install_cxx_compiler $BUILD_ENV gcc 2025-05-07T20:24:35.7070602Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:24:35.7070956Z env: 2025-05-07T20:24:35.7071185Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:24:35.7071515Z BUILD_ENV: build_binary 2025-05-07T20:24:35.7071777Z BUILD_TARGET: genai 2025-05-07T20:24:35.7072021Z BUILD_VARIANT: cuda 2025-05-07T20:24:35.7072267Z BUILD_CUDA_VERSION: 12.8.0 2025-05-07T20:24:35.7072545Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:24:35.7072870Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:24:35.7073375Z ##[endgroup] 2025-05-07T20:24:36.0452921Z ################################################################################ 2025-05-07T20:24:36.0453455Z # Install C/C++ Compilers 2025-05-07T20:24:36.0453799Z # 2025-05-07T20:24:36.0469763Z # [2025-05-07T20:24:36.046Z] + install_cxx_compiler build_binary gcc 2025-05-07T20:24:36.0470491Z ################################################################################ 2025-05-07T20:24:36.0470730Z 2025-05-07T20:24:36.0485769Z [EXEC] [ATTEMPT 0/3] + wget -q --timeout 1 pypi.org -O /dev/null 2025-05-07T20:24:36.1409318Z [CHECK] Network does not appear to be blocked. 2025-05-07T20:24:36.1420425Z [INSTALL] Installing GLIBC (architecture = 64) ... 2025-05-07T20:24:36.1443867Z [EXEC] [ATTEMPT 0/3] + conda install -n build_binary -c conda-forge --override-channels -y sysroot_linux-64=2.17 2025-05-07T20:24:37.0077667Z Channels: 2025-05-07T20:24:37.0078285Z - conda-forge 2025-05-07T20:24:37.0078631Z Platform: linux-64 2025-05-07T20:24:40.3344956Z Collecting package metadata (repodata.json): - \ | / done 2025-05-07T20:24:40.7047161Z Solving environment: \ done 2025-05-07T20:24:40.7658833Z 2025-05-07T20:24:40.7659150Z ## Package Plan ## 2025-05-07T20:24:40.7659372Z 2025-05-07T20:24:40.7659620Z environment location: /home/ec2-user/miniconda/envs/build_binary 2025-05-07T20:24:40.7659949Z 2025-05-07T20:24:40.7660042Z added / updated specs: 2025-05-07T20:24:40.7660313Z - sysroot_linux-64=2.17 2025-05-07T20:24:40.7660481Z 2025-05-07T20:24:40.7660485Z 2025-05-07T20:24:40.7660614Z The following packages will be downloaded: 2025-05-07T20:24:40.7660835Z 2025-05-07T20:24:40.7660957Z package | build 2025-05-07T20:24:40.7661280Z ---------------------------|----------------- 2025-05-07T20:24:40.7661719Z kernel-headers_linux-64-3.10.0| he073ed8_18 921 KB conda-forge 2025-05-07T20:24:40.7662228Z sysroot_linux-64-2.17 | h0157908_18 14.5 MB conda-forge 2025-05-07T20:24:40.7662661Z ------------------------------------------------------------ 2025-05-07T20:24:40.7663016Z Total: 15.4 MB 2025-05-07T20:24:40.7663240Z 2025-05-07T20:24:40.7663367Z The following NEW packages will be INSTALLED: 2025-05-07T20:24:40.7663602Z 2025-05-07T20:24:40.7663902Z kernel-headers_li~ conda-forge/noarch::kernel-headers_linux-64-3.10.0-he073ed8_18 2025-05-07T20:24:40.7664491Z sysroot_linux-64 conda-forge/noarch::sysroot_linux-64-2.17-h0157908_18 2025-05-07T20:24:40.7664821Z 2025-05-07T20:24:40.7664825Z 2025-05-07T20:24:40.7664829Z 2025-05-07T20:24:40.7664971Z Downloading and Extracting Packages: ...working... 2025-05-07T20:24:40.7665359Z sysroot_linux-64-2.1 | 14.5 MB | | 0% 2025-05-07T20:24:40.7667946Z 2025-05-07T20:24:40.8546297Z kernel-headers_linux | 921 KB | | 0%  2025-05-07T20:24:40.8546822Z 2025-05-07T20:24:40.9098372Z kernel-headers_linux | 921 KB | ########## | 100%  2025-05-07T20:24:41.0206646Z sysroot_linux-64-2.1 | 14.5 MB | | 0% 2025-05-07T20:24:41.0740934Z sysroot_linux-64-2.1 | 14.5 MB | #####8 | 58% 2025-05-07T20:24:41.0741434Z 2025-05-07T20:24:41.0742006Z kernel-headers_linux | 921 KB | ########## | 100%  2025-05-07T20:24:41.0742963Z 2025-05-07T20:24:41.1218912Z kernel-headers_linux | 921 KB | ########## | 100%  2025-05-07T20:24:41.1219349Z sysroot_linux-64-2.1 | 14.5 MB | ########## | 100% 2025-05-07T20:24:41.6431607Z sysroot_linux-64-2.1 | 14.5 MB | ########## | 100% 2025-05-07T20:24:41.6435725Z sysroot_linux-64-2.1 | 14.5 MB | ########## | 100% 2025-05-07T20:24:41.6436218Z 2025-05-07T20:24:41.6436499Z 2025-05-07T20:24:41.6437136Z  done 2025-05-07T20:24:41.7443023Z Preparing transaction: / done 2025-05-07T20:24:41.9448695Z Verifying transaction: \ | done 2025-05-07T20:24:42.1490803Z Executing transaction: - \ done 2025-05-07T20:24:42.3014381Z [CHECK] LD_LIBRARY_PATH = 2025-05-07T20:24:42.3014805Z [CHECK] CONDA_PREFIX is not set. 2025-05-07T20:24:43.9875045Z [CHECK] libstdc++.so.6 found in CONDA_PREFIX PATH (symbolic link): /home/ec2-user/miniconda/envs/build_binary/lib/libstdc++.so.6 2025-05-07T20:24:43.9888860Z [INSTALL] Installing GCC (11.4.0, 64) through Conda ... 2025-05-07T20:24:43.9912288Z [EXEC] [ATTEMPT 0/3] + conda install -n build_binary -c conda-forge --override-channels -y gxx_linux-64=11.4.0 2025-05-07T20:24:44.8777018Z Channels: 2025-05-07T20:24:44.8777293Z - conda-forge 2025-05-07T20:24:44.8777535Z Platform: linux-64 2025-05-07T20:24:48.1994689Z Collecting package metadata (repodata.json): - \ | / done 2025-05-07T20:24:49.1634101Z Solving environment: \ | / done 2025-05-07T20:24:49.2271430Z 2025-05-07T20:24:49.2271885Z ## Package Plan ## 2025-05-07T20:24:49.2272033Z 2025-05-07T20:24:49.2272267Z environment location: /home/ec2-user/miniconda/envs/build_binary 2025-05-07T20:24:49.2272727Z 2025-05-07T20:24:49.2272900Z added / updated specs: 2025-05-07T20:24:49.2273223Z - gxx_linux-64=11.4.0 2025-05-07T20:24:49.2273438Z 2025-05-07T20:24:49.2273443Z 2025-05-07T20:24:49.2273600Z The following packages will be downloaded: 2025-05-07T20:24:49.2273876Z 2025-05-07T20:24:49.2274043Z package | build 2025-05-07T20:24:49.2274386Z ---------------------------|----------------- 2025-05-07T20:24:49.2274910Z binutils_impl_linux-64-2.40| ha1999f0_7 6.0 MB conda-forge 2025-05-07T20:24:49.2275572Z binutils_linux-64-2.40 | hb3c18ed_4 28 KB conda-forge 2025-05-07T20:24:49.2276242Z gcc_impl_linux-64-11.4.0 | h00c12a0_13 53.0 MB conda-forge 2025-05-07T20:24:49.2276775Z gcc_linux-64-11.4.0 | ha077dfb_4 31 KB conda-forge 2025-05-07T20:24:49.2277247Z gxx_impl_linux-64-11.4.0 | h634f3ee_13 11.2 MB conda-forge 2025-05-07T20:24:49.2277716Z gxx_linux-64-11.4.0 | h35bfe5d_4 29 KB conda-forge 2025-05-07T20:24:49.2278191Z ld_impl_linux-64-2.40 | hf3520f5_7 691 KB conda-forge 2025-05-07T20:24:49.2278685Z libgcc-devel_linux-64-11.4.0| h8f596e0_113 2.3 MB conda-forge 2025-05-07T20:24:49.2279199Z libsanitizer-11.4.0 | h5763a12_13 3.5 MB conda-forge 2025-05-07T20:24:49.2279673Z libstdcxx-15.1.0 | h8f9b012_2 3.7 MB conda-forge 2025-05-07T20:24:49.2280178Z libstdcxx-devel_linux-64-11.4.0| h8f596e0_113 11.1 MB conda-forge 2025-05-07T20:24:49.2280803Z libstdcxx-ng-15.1.0 | h4852527_2 34 KB conda-forge 2025-05-07T20:24:49.2281391Z ------------------------------------------------------------ 2025-05-07T20:24:49.2281882Z Total: 91.6 MB 2025-05-07T20:24:49.2282103Z 2025-05-07T20:24:49.2282230Z The following NEW packages will be INSTALLED: 2025-05-07T20:24:49.2282468Z 2025-05-07T20:24:49.2283103Z binutils_impl_lin~ conda-forge/linux-64::binutils_impl_linux-64-2.40-ha1999f0_7 2025-05-07T20:24:49.2283930Z binutils_linux-64 conda-forge/linux-64::binutils_linux-64-2.40-hb3c18ed_4 2025-05-07T20:24:49.2284880Z gcc_impl_linux-64 conda-forge/linux-64::gcc_impl_linux-64-11.4.0-h00c12a0_13 2025-05-07T20:24:49.2285419Z gcc_linux-64 conda-forge/linux-64::gcc_linux-64-11.4.0-ha077dfb_4 2025-05-07T20:24:49.2285948Z gxx_impl_linux-64 conda-forge/linux-64::gxx_impl_linux-64-11.4.0-h634f3ee_13 2025-05-07T20:24:49.2286480Z gxx_linux-64 conda-forge/linux-64::gxx_linux-64-11.4.0-h35bfe5d_4 2025-05-07T20:24:49.2287037Z libgcc-devel_linu~ conda-forge/noarch::libgcc-devel_linux-64-11.4.0-h8f596e0_113 2025-05-07T20:24:49.2287621Z libsanitizer conda-forge/linux-64::libsanitizer-11.4.0-h5763a12_13 2025-05-07T20:24:49.2288143Z libstdcxx conda-forge/linux-64::libstdcxx-15.1.0-h8f9b012_2 2025-05-07T20:24:49.2288717Z libstdcxx-devel_l~ conda-forge/noarch::libstdcxx-devel_linux-64-11.4.0-h8f596e0_113 2025-05-07T20:24:49.2289242Z 2025-05-07T20:24:49.2289365Z The following packages will be UPDATED: 2025-05-07T20:24:49.2289577Z 2025-05-07T20:24:49.2289914Z ld_impl_linux-64 pkgs/main::ld_impl_linux-64-2.40-h12e~ --> conda-forge::ld_impl_linux-64-2.40-hf3520f5_7 2025-05-07T20:24:49.2290831Z libstdcxx-ng pkgs/main::libstdcxx-ng-11.2.0-h12345~ --> conda-forge::libstdcxx-ng-15.1.0-h4852527_2 2025-05-07T20:24:49.2291421Z 2025-05-07T20:24:49.2291439Z 2025-05-07T20:24:49.2291444Z 2025-05-07T20:24:49.2291617Z Downloading and Extracting Packages: ...working... 2025-05-07T20:24:49.2292023Z gcc_impl_linux-64-11 | 53.0 MB | | 0% 2025-05-07T20:24:49.2292261Z 2025-05-07T20:24:49.2292520Z gxx_impl_linux-64-11 | 11.2 MB | | 0%  2025-05-07T20:24:49.2292774Z 2025-05-07T20:24:49.2292778Z 2025-05-07T20:24:49.2293003Z libstdcxx-devel_linu | 11.1 MB | | 0%  2025-05-07T20:24:49.2293279Z 2025-05-07T20:24:49.2293290Z 2025-05-07T20:24:49.2293294Z 2025-05-07T20:24:49.2310440Z binutils_impl_linux- | 6.0 MB | | 0%  2025-05-07T20:24:49.2310851Z 2025-05-07T20:24:49.2310857Z 2025-05-07T20:24:49.2310862Z 2025-05-07T20:24:49.2317219Z 2025-05-07T20:24:49.2330606Z libstdcxx-15.1.0 | 3.7 MB | | 0%  2025-05-07T20:24:49.2331002Z 2025-05-07T20:24:49.2331021Z 2025-05-07T20:24:49.2331026Z 2025-05-07T20:24:49.2331031Z 2025-05-07T20:24:49.2331044Z 2025-05-07T20:24:49.2331871Z libsanitizer-11.4.0 | 3.5 MB | | 0%  2025-05-07T20:24:49.2332278Z 2025-05-07T20:24:49.2332283Z 2025-05-07T20:24:49.2332289Z 2025-05-07T20:24:49.2332305Z 2025-05-07T20:24:49.2332311Z 2025-05-07T20:24:49.2332323Z 2025-05-07T20:24:49.2333462Z libgcc-devel_linux-6 | 2.3 MB | | 0%  2025-05-07T20:24:49.2333865Z 2025-05-07T20:24:49.2333871Z 2025-05-07T20:24:49.2333888Z 2025-05-07T20:24:49.2333900Z 2025-05-07T20:24:49.2333918Z 2025-05-07T20:24:49.2333923Z 2025-05-07T20:24:49.2333929Z 2025-05-07T20:24:49.2334945Z ld_impl_linux-64-2.4 | 691 KB | | 0%  2025-05-07T20:24:49.2335340Z 2025-05-07T20:24:49.2335356Z 2025-05-07T20:24:49.2335362Z 2025-05-07T20:24:49.2335376Z 2025-05-07T20:24:49.2335381Z 2025-05-07T20:24:49.2335386Z 2025-05-07T20:24:49.2335392Z 2025-05-07T20:24:49.2335397Z 2025-05-07T20:24:49.2336433Z libstdcxx-ng-15.1.0 | 34 KB | | 0%  2025-05-07T20:24:49.2336849Z 2025-05-07T20:24:49.2336866Z 2025-05-07T20:24:49.2336872Z 2025-05-07T20:24:49.2336877Z 2025-05-07T20:24:49.2336882Z 2025-05-07T20:24:49.2336887Z 2025-05-07T20:24:49.2336892Z 2025-05-07T20:24:49.2336897Z 2025-05-07T20:24:49.2336902Z 2025-05-07T20:24:49.2346870Z gcc_linux-64-11.4.0 | 31 KB | | 0%  2025-05-07T20:24:49.2347293Z 2025-05-07T20:24:49.2347298Z 2025-05-07T20:24:49.2347303Z 2025-05-07T20:24:49.2347309Z 2025-05-07T20:24:49.2347323Z 2025-05-07T20:24:49.2347329Z 2025-05-07T20:24:49.2347334Z 2025-05-07T20:24:49.2347339Z 2025-05-07T20:24:49.2347344Z 2025-05-07T20:24:49.2355324Z 2025-05-07T20:24:49.2356503Z gxx_linux-64-11.4.0 | 29 KB | | 0%  2025-05-07T20:24:49.2356919Z 2025-05-07T20:24:49.2356924Z 2025-05-07T20:24:49.2356929Z 2025-05-07T20:24:49.2356934Z 2025-05-07T20:24:49.2356940Z 2025-05-07T20:24:49.2356945Z 2025-05-07T20:24:49.2356950Z 2025-05-07T20:24:49.2356955Z 2025-05-07T20:24:49.2356967Z 2025-05-07T20:24:49.2356972Z 2025-05-07T20:24:49.2356977Z 2025-05-07T20:24:49.3279291Z binutils_linux-64-2. | 28 KB | | 0%  2025-05-07T20:24:49.3281106Z 2025-05-07T20:24:49.3288892Z gxx_impl_linux-64-11 | 11.2 MB | 2 | 2%  2025-05-07T20:24:49.3289272Z 2025-05-07T20:24:49.3289278Z 2025-05-07T20:24:49.3300164Z libstdcxx-devel_linu | 11.1 MB | | 1%  2025-05-07T20:24:49.3301053Z 2025-05-07T20:24:49.3301059Z 2025-05-07T20:24:49.3301062Z 2025-05-07T20:24:49.3326681Z binutils_impl_linux- | 6.0 MB | 6 | 7%  2025-05-07T20:24:49.3327060Z 2025-05-07T20:24:49.3327065Z 2025-05-07T20:24:49.3327068Z 2025-05-07T20:24:49.3327072Z 2025-05-07T20:24:49.3633784Z libstdcxx-15.1.0 | 3.7 MB | #2 | 13%  2025-05-07T20:24:49.4280617Z gcc_impl_linux-64-11 | 53.0 MB | | 0% 2025-05-07T20:24:49.4281587Z 2025-05-07T20:24:49.4294509Z gxx_impl_linux-64-11 | 11.2 MB | ####6 | 47%  2025-05-07T20:24:49.4294773Z 2025-05-07T20:24:49.4294777Z 2025-05-07T20:24:49.4301171Z libstdcxx-devel_linu | 11.1 MB | #9 | 20%  2025-05-07T20:24:49.4301585Z 2025-05-07T20:24:49.4301592Z 2025-05-07T20:24:49.4301597Z 2025-05-07T20:24:49.4641018Z binutils_impl_linux- | 6.0 MB | ####### | 71%  2025-05-07T20:24:49.4644331Z gcc_impl_linux-64-11 | 53.0 MB | 7 | 8% 2025-05-07T20:24:49.4644681Z 2025-05-07T20:24:49.4644702Z 2025-05-07T20:24:49.4644708Z 2025-05-07T20:24:49.4647757Z 2025-05-07T20:24:49.4654404Z libstdcxx-15.1.0 | 3.7 MB | ########## | 100%  2025-05-07T20:24:49.4654760Z 2025-05-07T20:24:49.4654764Z 2025-05-07T20:24:49.4654768Z 2025-05-07T20:24:49.4656074Z 2025-05-07T20:24:49.5083855Z libstdcxx-15.1.0 | 3.7 MB | ########## | 100%  2025-05-07T20:24:49.5084141Z 2025-05-07T20:24:49.5084145Z 2025-05-07T20:24:49.5084148Z 2025-05-07T20:24:49.5084152Z 2025-05-07T20:24:49.5085798Z 2025-05-07T20:24:49.5281926Z libsanitizer-11.4.0 | 3.5 MB | | 0%  2025-05-07T20:24:49.5284175Z 2025-05-07T20:24:49.5298638Z gxx_impl_linux-64-11 | 11.2 MB | #######9 | 79%  2025-05-07T20:24:49.5299029Z 2025-05-07T20:24:49.5302169Z 2025-05-07T20:24:49.5644288Z libstdcxx-devel_linu | 11.1 MB | ####4 | 45%  2025-05-07T20:24:49.6087003Z gcc_impl_linux-64-11 | 53.0 MB | #2 | 13% 2025-05-07T20:24:49.6087365Z 2025-05-07T20:24:49.6087392Z 2025-05-07T20:24:49.6087398Z 2025-05-07T20:24:49.6087415Z 2025-05-07T20:24:49.6087422Z 2025-05-07T20:24:49.6305113Z libsanitizer-11.4.0 | 3.5 MB | #########5 | 96%  2025-05-07T20:24:49.6305493Z 2025-05-07T20:24:49.6305497Z 2025-05-07T20:24:49.6473462Z libstdcxx-devel_linu | 11.1 MB | #######1 | 72%  2025-05-07T20:24:49.6473850Z 2025-05-07T20:24:49.6473856Z 2025-05-07T20:24:49.6476592Z 2025-05-07T20:24:49.6645537Z binutils_impl_linux- | 6.0 MB | ########## | 100%  2025-05-07T20:24:49.6911350Z gcc_impl_linux-64-11 | 53.0 MB | #9 | 20% 2025-05-07T20:24:49.6911706Z 2025-05-07T20:24:49.6911713Z 2025-05-07T20:24:49.6911718Z 2025-05-07T20:24:49.6911723Z 2025-05-07T20:24:49.6911728Z 2025-05-07T20:24:49.6911733Z 2025-05-07T20:24:49.7288600Z libgcc-devel_linux-6 | 2.3 MB | | 1%  2025-05-07T20:24:49.7289028Z 2025-05-07T20:24:49.7289033Z 2025-05-07T20:24:49.7289038Z 2025-05-07T20:24:49.7289066Z 2025-05-07T20:24:49.7289071Z 2025-05-07T20:24:49.7313520Z libsanitizer-11.4.0 | 3.5 MB | ########## | 100%  2025-05-07T20:24:49.7313935Z 2025-05-07T20:24:49.7313941Z 2025-05-07T20:24:49.7647253Z libstdcxx-devel_linu | 11.1 MB | #########8 | 99%  2025-05-07T20:24:49.7715735Z gcc_impl_linux-64-11 | 53.0 MB | ##5 | 26% 2025-05-07T20:24:49.7716037Z 2025-05-07T20:24:49.7716041Z 2025-05-07T20:24:49.7716045Z 2025-05-07T20:24:49.7716048Z 2025-05-07T20:24:49.7716052Z 2025-05-07T20:24:49.7716056Z 2025-05-07T20:24:49.7716059Z 2025-05-07T20:24:49.8157775Z ld_impl_linux-64-2.4 | 691 KB | 2 | 2%  2025-05-07T20:24:49.8158117Z 2025-05-07T20:24:49.8158121Z 2025-05-07T20:24:49.8158125Z 2025-05-07T20:24:49.8158129Z 2025-05-07T20:24:49.8158133Z 2025-05-07T20:24:49.8158137Z 2025-05-07T20:24:49.8160018Z 2025-05-07T20:24:49.8527982Z ld_impl_linux-64-2.4 | 691 KB | ########## | 100%  2025-05-07T20:24:49.8528350Z 2025-05-07T20:24:49.8528633Z 2025-05-07T20:24:49.8528636Z 2025-05-07T20:24:49.8528640Z 2025-05-07T20:24:49.8528643Z 2025-05-07T20:24:49.8528647Z 2025-05-07T20:24:49.8528651Z 2025-05-07T20:24:49.8528654Z 2025-05-07T20:24:49.8551010Z libstdcxx-ng-15.1.0 | 34 KB | ####7 | 47%  2025-05-07T20:24:49.8551373Z 2025-05-07T20:24:49.8551378Z 2025-05-07T20:24:49.8551383Z 2025-05-07T20:24:49.8551388Z 2025-05-07T20:24:49.8551393Z 2025-05-07T20:24:49.8553149Z 2025-05-07T20:24:49.8559903Z libgcc-devel_linux-6 | 2.3 MB | ########## | 100%  2025-05-07T20:24:49.8560291Z 2025-05-07T20:24:49.8560295Z 2025-05-07T20:24:49.8560298Z 2025-05-07T20:24:49.8560302Z 2025-05-07T20:24:49.8560306Z 2025-05-07T20:24:49.8562302Z 2025-05-07T20:24:49.8580314Z libgcc-devel_linux-6 | 2.3 MB | ########## | 100%  2025-05-07T20:24:49.8580737Z 2025-05-07T20:24:49.8580743Z 2025-05-07T20:24:49.8580748Z 2025-05-07T20:24:49.8580753Z 2025-05-07T20:24:49.8580758Z 2025-05-07T20:24:49.8580773Z 2025-05-07T20:24:49.8580778Z 2025-05-07T20:24:49.8583941Z 2025-05-07T20:24:49.8649377Z libstdcxx-ng-15.1.0 | 34 KB | ########## | 100%  2025-05-07T20:24:49.9023169Z gcc_impl_linux-64-11 | 53.0 MB | ###1 | 31% 2025-05-07T20:24:49.9023480Z 2025-05-07T20:24:49.9023486Z 2025-05-07T20:24:49.9023491Z 2025-05-07T20:24:49.9023496Z 2025-05-07T20:24:49.9023502Z 2025-05-07T20:24:49.9023507Z 2025-05-07T20:24:49.9023512Z 2025-05-07T20:24:49.9023517Z 2025-05-07T20:24:49.9023522Z 2025-05-07T20:24:49.9063659Z gcc_linux-64-11.4.0 | 31 KB | #####2 | 52%  2025-05-07T20:24:49.9064067Z 2025-05-07T20:24:49.9064072Z 2025-05-07T20:24:49.9064076Z 2025-05-07T20:24:49.9064079Z 2025-05-07T20:24:49.9064083Z 2025-05-07T20:24:49.9064086Z 2025-05-07T20:24:49.9064098Z 2025-05-07T20:24:49.9064102Z 2025-05-07T20:24:49.9064564Z 2025-05-07T20:24:49.9213773Z gcc_linux-64-11.4.0 | 31 KB | ########## | 100%  2025-05-07T20:24:49.9214116Z 2025-05-07T20:24:49.9214121Z 2025-05-07T20:24:49.9214125Z 2025-05-07T20:24:49.9214129Z 2025-05-07T20:24:49.9214132Z 2025-05-07T20:24:49.9214136Z 2025-05-07T20:24:49.9214140Z 2025-05-07T20:24:49.9214143Z 2025-05-07T20:24:49.9214156Z 2025-05-07T20:24:49.9215543Z 2025-05-07T20:24:49.9271250Z gxx_linux-64-11.4.0 | 29 KB | #####5 | 55%  2025-05-07T20:24:49.9271580Z 2025-05-07T20:24:49.9271584Z 2025-05-07T20:24:49.9271587Z 2025-05-07T20:24:49.9271591Z 2025-05-07T20:24:49.9271595Z 2025-05-07T20:24:49.9271598Z 2025-05-07T20:24:49.9271602Z 2025-05-07T20:24:49.9271606Z 2025-05-07T20:24:49.9271609Z 2025-05-07T20:24:49.9272804Z 2025-05-07T20:24:49.9505126Z gxx_linux-64-11.4.0 | 29 KB | ########## | 100%  2025-05-07T20:24:49.9505558Z 2025-05-07T20:24:49.9505562Z 2025-05-07T20:24:49.9505566Z 2025-05-07T20:24:49.9505570Z 2025-05-07T20:24:49.9505574Z 2025-05-07T20:24:49.9505577Z 2025-05-07T20:24:49.9505599Z 2025-05-07T20:24:49.9505603Z 2025-05-07T20:24:49.9505607Z 2025-05-07T20:24:49.9505610Z 2025-05-07T20:24:49.9505614Z 2025-05-07T20:24:49.9545023Z binutils_linux-64-2. | 28 KB | #####6 | 56%  2025-05-07T20:24:49.9545735Z 2025-05-07T20:24:49.9545744Z 2025-05-07T20:24:49.9545749Z 2025-05-07T20:24:49.9545755Z 2025-05-07T20:24:49.9545760Z 2025-05-07T20:24:49.9545765Z 2025-05-07T20:24:49.9545770Z 2025-05-07T20:24:49.9545776Z 2025-05-07T20:24:49.9545781Z 2025-05-07T20:24:49.9545786Z 2025-05-07T20:24:49.9549442Z 2025-05-07T20:24:49.9611547Z binutils_linux-64-2. | 28 KB | ########## | 100%  2025-05-07T20:24:49.9612067Z 2025-05-07T20:24:49.9612076Z 2025-05-07T20:24:49.9612083Z 2025-05-07T20:24:49.9613996Z 2025-05-07T20:24:49.9652288Z libstdcxx-15.1.0 | 3.7 MB | ########## | 100%  2025-05-07T20:24:49.9796258Z gcc_impl_linux-64-11 | 53.0 MB | ###8 | 39% 2025-05-07T20:24:49.9798849Z 2025-05-07T20:24:50.0099156Z gxx_impl_linux-64-11 | 11.2 MB | ########## | 100%  2025-05-07T20:24:50.0099451Z 2025-05-07T20:24:50.0099455Z 2025-05-07T20:24:50.0177319Z libstdcxx-devel_linu | 11.1 MB | ########## | 100%  2025-05-07T20:24:50.0177725Z 2025-05-07T20:24:50.0177766Z 2025-05-07T20:24:50.0177770Z 2025-05-07T20:24:50.0177774Z 2025-05-07T20:24:50.0177777Z 2025-05-07T20:24:50.0177781Z 2025-05-07T20:24:50.0177785Z 2025-05-07T20:24:50.0180278Z ld_impl_linux-64-2.4 | 691 KB | ########## | 100%  2025-05-07T20:24:50.0180594Z 2025-05-07T20:24:50.0180598Z 2025-05-07T20:24:50.0180602Z 2025-05-07T20:24:50.0180606Z 2025-05-07T20:24:50.0180609Z 2025-05-07T20:24:50.0180613Z 2025-05-07T20:24:50.0180765Z 2025-05-07T20:24:50.0798714Z ld_impl_linux-64-2.4 | 691 KB | ########## | 100%  2025-05-07T20:24:50.0799214Z 2025-05-07T20:24:50.0799221Z 2025-05-07T20:24:50.0799226Z 2025-05-07T20:24:50.0799231Z 2025-05-07T20:24:50.0799269Z 2025-05-07T20:24:50.0874107Z libsanitizer-11.4.0 | 3.5 MB | ########## | 100%  2025-05-07T20:24:50.0874577Z 2025-05-07T20:24:50.0874585Z 2025-05-07T20:24:50.0874590Z 2025-05-07T20:24:50.0874596Z 2025-05-07T20:24:50.0874601Z 2025-05-07T20:24:50.0874629Z 2025-05-07T20:24:50.1409805Z libgcc-devel_linux-6 | 2.3 MB | ########## | 100%  2025-05-07T20:24:50.1410158Z 2025-05-07T20:24:50.1410163Z 2025-05-07T20:24:50.1410167Z 2025-05-07T20:24:50.1410170Z 2025-05-07T20:24:50.1410174Z 2025-05-07T20:24:50.1410178Z 2025-05-07T20:24:50.1410182Z 2025-05-07T20:24:50.1410185Z 2025-05-07T20:24:50.1410856Z 2025-05-07T20:24:50.1414005Z gcc_linux-64-11.4.0 | 31 KB | ########## | 100%  2025-05-07T20:24:50.1414343Z 2025-05-07T20:24:50.1414347Z 2025-05-07T20:24:50.1414350Z 2025-05-07T20:24:50.1414354Z 2025-05-07T20:24:50.1414358Z 2025-05-07T20:24:50.1414361Z 2025-05-07T20:24:50.1414365Z 2025-05-07T20:24:50.1414368Z 2025-05-07T20:24:50.1419818Z libstdcxx-ng-15.1.0 | 34 KB | ########## | 100%  2025-05-07T20:24:50.1420212Z 2025-05-07T20:24:50.1420218Z 2025-05-07T20:24:50.1420223Z 2025-05-07T20:24:50.1420229Z 2025-05-07T20:24:50.1420234Z 2025-05-07T20:24:50.1420244Z 2025-05-07T20:24:50.1420265Z 2025-05-07T20:24:50.1420271Z 2025-05-07T20:24:50.1421566Z 2025-05-07T20:24:50.1425744Z gcc_linux-64-11.4.0 | 31 KB | ########## | 100%  2025-05-07T20:24:50.1426175Z 2025-05-07T20:24:50.1426179Z 2025-05-07T20:24:50.1426183Z 2025-05-07T20:24:50.1426186Z 2025-05-07T20:24:50.1426190Z 2025-05-07T20:24:50.1426202Z 2025-05-07T20:24:50.1426206Z 2025-05-07T20:24:50.1429194Z 2025-05-07T20:24:50.1588281Z libstdcxx-ng-15.1.0 | 34 KB | ########## | 100%  2025-05-07T20:24:50.2089575Z gcc_impl_linux-64-11 | 53.0 MB | ####4 | 45% 2025-05-07T20:24:50.2089992Z 2025-05-07T20:24:50.2089998Z 2025-05-07T20:24:50.2090003Z 2025-05-07T20:24:50.2090009Z 2025-05-07T20:24:50.2090050Z 2025-05-07T20:24:50.2090056Z 2025-05-07T20:24:50.2090062Z 2025-05-07T20:24:50.2090067Z 2025-05-07T20:24:50.2090073Z 2025-05-07T20:24:50.2090079Z 2025-05-07T20:24:50.2090085Z 2025-05-07T20:24:50.2096745Z binutils_linux-64-2. | 28 KB | ########## | 100%  2025-05-07T20:24:50.2097104Z 2025-05-07T20:24:50.2097109Z 2025-05-07T20:24:50.2097112Z 2025-05-07T20:24:50.2097116Z 2025-05-07T20:24:50.2097120Z 2025-05-07T20:24:50.2097123Z 2025-05-07T20:24:50.2097127Z 2025-05-07T20:24:50.2097139Z 2025-05-07T20:24:50.2097143Z 2025-05-07T20:24:50.2097147Z 2025-05-07T20:24:50.2097150Z 2025-05-07T20:24:50.2151021Z binutils_linux-64-2. | 28 KB | ########## | 100%  2025-05-07T20:24:50.2151368Z 2025-05-07T20:24:50.2151373Z 2025-05-07T20:24:50.2151377Z 2025-05-07T20:24:50.2151381Z 2025-05-07T20:24:50.2151385Z 2025-05-07T20:24:50.2151389Z 2025-05-07T20:24:50.2151392Z 2025-05-07T20:24:50.2151396Z 2025-05-07T20:24:50.2151599Z 2025-05-07T20:24:50.2151603Z 2025-05-07T20:24:50.2155812Z gxx_linux-64-11.4.0 | 29 KB | ########## | 100%  2025-05-07T20:24:50.2156135Z 2025-05-07T20:24:50.2156140Z 2025-05-07T20:24:50.2156143Z 2025-05-07T20:24:50.2156147Z 2025-05-07T20:24:50.2156161Z 2025-05-07T20:24:50.2156165Z 2025-05-07T20:24:50.2156168Z 2025-05-07T20:24:50.2156172Z 2025-05-07T20:24:50.2156176Z 2025-05-07T20:24:50.2156326Z 2025-05-07T20:24:50.2590377Z gxx_linux-64-11.4.0 | 29 KB | ########## | 100%  2025-05-07T20:24:50.3591270Z gcc_impl_linux-64-11 | 53.0 MB | #####4 | 54% 2025-05-07T20:24:50.4013926Z gcc_impl_linux-64-11 | 53.0 MB | ######4 | 65% 2025-05-07T20:24:50.4014338Z 2025-05-07T20:24:50.4014347Z 2025-05-07T20:24:50.4015154Z 2025-05-07T20:24:50.4591559Z binutils_impl_linux- | 6.0 MB | ########## | 100%  2025-05-07T20:24:50.5592188Z gcc_impl_linux-64-11 | 53.0 MB | ######## | 80% 2025-05-07T20:24:50.6351369Z gcc_impl_linux-64-11 | 53.0 MB | #########8 | 98% 2025-05-07T20:24:50.6351792Z 2025-05-07T20:24:50.6783944Z gxx_impl_linux-64-11 | 11.2 MB | ########## | 100%  2025-05-07T20:24:50.9462285Z gcc_impl_linux-64-11 | 53.0 MB | ########## | 100% 2025-05-07T20:24:50.9462853Z 2025-05-07T20:24:50.9462863Z 2025-05-07T20:24:51.3906593Z libstdcxx-devel_linu | 11.1 MB | ########## | 100%  2025-05-07T20:24:51.3912263Z gcc_impl_linux-64-11 | 53.0 MB | ########## | 100% 2025-05-07T20:24:51.3912617Z 2025-05-07T20:24:51.3912831Z 2025-05-07T20:24:51.3913047Z  2025-05-07T20:24:51.3913258Z 2025-05-07T20:24:51.3913273Z 2025-05-07T20:24:51.3913445Z  2025-05-07T20:24:51.3913661Z 2025-05-07T20:24:51.3913664Z 2025-05-07T20:24:51.3913668Z 2025-05-07T20:24:51.3913839Z  2025-05-07T20:24:51.3914085Z 2025-05-07T20:24:51.3914088Z 2025-05-07T20:24:51.3914092Z 2025-05-07T20:24:51.3914095Z 2025-05-07T20:24:51.3914267Z  2025-05-07T20:24:51.3914492Z 2025-05-07T20:24:51.3914496Z 2025-05-07T20:24:51.3914512Z 2025-05-07T20:24:51.3914517Z 2025-05-07T20:24:51.3914521Z 2025-05-07T20:24:51.3914694Z  2025-05-07T20:24:51.3914923Z 2025-05-07T20:24:51.3914927Z 2025-05-07T20:24:51.3914930Z 2025-05-07T20:24:51.3914934Z 2025-05-07T20:24:51.3914938Z 2025-05-07T20:24:51.3914941Z 2025-05-07T20:24:51.3915115Z  2025-05-07T20:24:51.3915344Z 2025-05-07T20:24:51.3915348Z 2025-05-07T20:24:51.3915352Z 2025-05-07T20:24:51.3915355Z 2025-05-07T20:24:51.3915359Z 2025-05-07T20:24:51.3915362Z 2025-05-07T20:24:51.3915366Z 2025-05-07T20:24:51.3915553Z  2025-05-07T20:24:51.3915790Z 2025-05-07T20:24:51.3915794Z 2025-05-07T20:24:51.3915798Z 2025-05-07T20:24:51.3915801Z 2025-05-07T20:24:51.3915805Z 2025-05-07T20:24:51.3915808Z 2025-05-07T20:24:51.3915812Z 2025-05-07T20:24:51.3915816Z 2025-05-07T20:24:51.3916241Z  2025-05-07T20:24:51.3916478Z 2025-05-07T20:24:51.3916482Z 2025-05-07T20:24:51.3916486Z 2025-05-07T20:24:51.3916489Z 2025-05-07T20:24:51.3916493Z 2025-05-07T20:24:51.3916496Z 2025-05-07T20:24:51.3916500Z 2025-05-07T20:24:51.3916504Z 2025-05-07T20:24:51.3916507Z 2025-05-07T20:24:51.3916699Z  2025-05-07T20:24:51.3916934Z 2025-05-07T20:24:51.3916937Z 2025-05-07T20:24:51.3916941Z 2025-05-07T20:24:51.3916945Z 2025-05-07T20:24:51.3916948Z 2025-05-07T20:24:51.3916952Z 2025-05-07T20:24:51.3916956Z 2025-05-07T20:24:51.3916959Z 2025-05-07T20:24:51.3916963Z 2025-05-07T20:24:51.3917104Z 2025-05-07T20:24:51.3917303Z  2025-05-07T20:24:51.3917536Z 2025-05-07T20:24:51.3917540Z 2025-05-07T20:24:51.3917543Z 2025-05-07T20:24:51.3917547Z 2025-05-07T20:24:51.3917556Z 2025-05-07T20:24:51.3917560Z 2025-05-07T20:24:51.3917563Z 2025-05-07T20:24:51.3917567Z 2025-05-07T20:24:51.3917571Z 2025-05-07T20:24:51.3917574Z 2025-05-07T20:24:51.3917578Z 2025-05-07T20:24:51.3917785Z  done 2025-05-07T20:24:51.4919965Z Preparing transaction: \ done 2025-05-07T20:24:51.7935910Z Verifying transaction: / - \ done 2025-05-07T20:24:51.8950949Z Executing transaction: / done 2025-05-07T20:24:52.0610101Z [INSTALL] Setting the C/C++ compiler symlinks ... 2025-05-07T20:24:55.9395758Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/bin/x86_64-conda-linux-gnu-cc /home/ec2-user/miniconda/envs/build_binary/bin/cc 2025-05-07T20:24:55.9396389Z 2025-05-07T20:24:55.9410190Z 2025-05-07T20:24:55.9428417Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/bin/x86_64-conda-linux-gnu-cc /home/ec2-user/miniconda/envs/build_binary/bin/gcc 2025-05-07T20:24:55.9428998Z 2025-05-07T20:24:55.9440918Z 2025-05-07T20:24:55.9458228Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/bin/x86_64-conda-linux-gnu-c++ /home/ec2-user/miniconda/envs/build_binary/bin/c++ 2025-05-07T20:24:55.9458820Z 2025-05-07T20:24:55.9471421Z 2025-05-07T20:24:55.9489409Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/bin/x86_64-conda-linux-gnu-c++ /home/ec2-user/miniconda/envs/build_binary/bin/g++ 2025-05-07T20:24:55.9489995Z 2025-05-07T20:24:55.9501215Z 2025-05-07T20:24:57.8349368Z /home/ec2-user/miniconda/envs/build_binary/bin/cc 2025-05-07T20:24:57.8349648Z 2025-05-07T20:24:57.8979704Z [CHECK] Binary cc found in PATH 2025-05-07T20:24:59.7780568Z /home/ec2-user/miniconda/envs/build_binary/bin/gcc 2025-05-07T20:24:59.7780874Z 2025-05-07T20:24:59.8405093Z [CHECK] Binary gcc found in PATH 2025-05-07T20:25:01.7245961Z /home/ec2-user/miniconda/envs/build_binary/bin/c++ 2025-05-07T20:25:01.7246427Z 2025-05-07T20:25:01.7885838Z [CHECK] Binary c++ found in PATH 2025-05-07T20:25:03.6698603Z /home/ec2-user/miniconda/envs/build_binary/bin/g++ 2025-05-07T20:25:03.6698892Z 2025-05-07T20:25:03.7321627Z [CHECK] Binary g++ found in PATH 2025-05-07T20:25:03.7326290Z [INFO] Printing out all preprocessor defines in the C compiler ... 2025-05-07T20:25:03.7326857Z + conda run -n build_binary cc -dM -E - 2025-05-07T20:25:03.7327099Z 2025-05-07T20:25:05.6140840Z #define __DBL_MIN_EXP__ (-1021) 2025-05-07T20:25:05.6141311Z #define __UINT_LEAST16_MAX__ 0xffff 2025-05-07T20:25:05.6141702Z #define __ATOMIC_ACQUIRE 2 2025-05-07T20:25:05.6142053Z #define __FLT128_MAX_10_EXP__ 4932 2025-05-07T20:25:05.6142463Z #define __FLT_MIN__ 1.17549435082228750796873653722224568e-38F 2025-05-07T20:25:05.6142842Z #define __GCC_IEC_559_COMPLEX 2 2025-05-07T20:25:05.6143137Z #define __UINT_LEAST8_TYPE__ unsigned char 2025-05-07T20:25:05.6143495Z #define __SIZEOF_FLOAT80__ 16 2025-05-07T20:25:05.6143873Z #define __INTMAX_C(c) c ## L 2025-05-07T20:25:05.6144226Z #define __CHAR_BIT__ 8 2025-05-07T20:25:05.6144538Z #define __UINT8_MAX__ 0xff 2025-05-07T20:25:05.6145220Z #define __SCHAR_WIDTH__ 8 2025-05-07T20:25:05.6145490Z #define __WINT_MAX__ 0xffffffffU 2025-05-07T20:25:05.6145770Z #define __FLT32_MIN_EXP__ (-125) 2025-05-07T20:25:05.6146051Z #define __ORDER_LITTLE_ENDIAN__ 1234 2025-05-07T20:25:05.6146357Z #define __SIZE_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:05.6146659Z #define __WCHAR_MAX__ 0x7fffffff 2025-05-07T20:25:05.6146957Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_1 1 2025-05-07T20:25:05.6147292Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_2 1 2025-05-07T20:25:05.6147633Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_4 1 2025-05-07T20:25:05.6148054Z #define __DBL_DENORM_MIN__ ((double)4.94065645841246544176568792868221372e-324L) 2025-05-07T20:25:05.6148494Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_8 1 2025-05-07T20:25:05.6148979Z #define __GCC_ATOMIC_CHAR_LOCK_FREE 2 2025-05-07T20:25:05.6149268Z #define __GCC_IEC_559 2 2025-05-07T20:25:05.6149525Z #define __FLT32X_DECIMAL_DIG__ 17 2025-05-07T20:25:05.6149939Z #define __FLT_EVAL_METHOD__ 0 2025-05-07T20:25:05.6150215Z #define __FLT64_DECIMAL_DIG__ 17 2025-05-07T20:25:05.6150501Z #define __GCC_ATOMIC_CHAR32_T_LOCK_FREE 2 2025-05-07T20:25:05.6150841Z #define __UINT_FAST64_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:05.6151170Z #define __SIG_ATOMIC_TYPE__ int 2025-05-07T20:25:05.6151448Z #define __DBL_MIN_10_EXP__ (-307) 2025-05-07T20:25:05.6151728Z #define __FINITE_MATH_ONLY__ 0 2025-05-07T20:25:05.6152000Z #define __FLT32X_MAX_EXP__ 1024 2025-05-07T20:25:05.6152263Z #define __FLT32_HAS_DENORM__ 1 2025-05-07T20:25:05.6152529Z #define __UINT_FAST8_MAX__ 0xff 2025-05-07T20:25:05.6152797Z #define __FLT32_MAX_10_EXP__ 38 2025-05-07T20:25:05.6153055Z #define __DEC64_MAX_EXP__ 385 2025-05-07T20:25:05.6153310Z #define __INT8_C(c) c 2025-05-07T20:25:05.6153556Z #define __INT_LEAST8_WIDTH__ 8 2025-05-07T20:25:05.6153854Z #define __UINT_LEAST64_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:05.6154189Z #define __SHRT_MAX__ 0x7fff 2025-05-07T20:25:05.6154517Z #define __LDBL_MAX__ 1.18973149535723176502126385303097021e+4932L 2025-05-07T20:25:05.6154885Z #define __FLT64X_MAX_10_EXP__ 4932 2025-05-07T20:25:05.6155168Z #define __LDBL_IS_IEC_60559__ 2 2025-05-07T20:25:05.6155436Z #define __FLT64X_HAS_QUIET_NAN__ 1 2025-05-07T20:25:05.6155715Z #define __UINT_LEAST8_MAX__ 0xff 2025-05-07T20:25:05.6155996Z #define __GCC_ATOMIC_BOOL_LOCK_FREE 2 2025-05-07T20:25:05.6156406Z #define __FLT128_DENORM_MIN__ 6.47517511943802511092443895822764655e-4966F128 2025-05-07T20:25:05.6156845Z #define __UINTMAX_TYPE__ long unsigned int 2025-05-07T20:25:05.6157137Z #define __linux 1 2025-05-07T20:25:05.6157365Z #define __DEC32_EPSILON__ 1E-6DF 2025-05-07T20:25:05.6157650Z #define __FLT_EVAL_METHOD_TS_18661_3__ 0 2025-05-07T20:25:05.6157930Z #define __unix 1 2025-05-07T20:25:05.6158163Z #define __UINT32_MAX__ 0xffffffffU 2025-05-07T20:25:05.6158446Z #define __FLT128_MIN_EXP__ (-16381) 2025-05-07T20:25:05.6158720Z #define __WINT_MIN__ 0U 2025-05-07T20:25:05.6158966Z #define __FLT128_MIN_10_EXP__ (-4931) 2025-05-07T20:25:05.6159261Z #define __FLT32X_IS_IEC_60559__ 2 2025-05-07T20:25:05.6159532Z #define __INT_LEAST16_WIDTH__ 16 2025-05-07T20:25:05.6159808Z #define __SCHAR_MAX__ 0x7f 2025-05-07T20:25:05.6160061Z #define __FLT128_MANT_DIG__ 113 2025-05-07T20:25:05.6160349Z #define __WCHAR_MIN__ (-__WCHAR_MAX__ - 1) 2025-05-07T20:25:05.6160655Z #define __INT64_C(c) c ## L 2025-05-07T20:25:05.6160921Z #define __GCC_ATOMIC_POINTER_LOCK_FREE 2 2025-05-07T20:25:05.6161227Z #define __FLT32X_MANT_DIG__ 53 2025-05-07T20:25:05.6161487Z #define __USER_LABEL_PREFIX__ 2025-05-07T20:25:05.6161851Z #define __FLT64X_EPSILON__ 1.08420217248550443400745280086994171e-19F64x 2025-05-07T20:25:05.6162247Z #define __STDC_HOSTED__ 1 2025-05-07T20:25:05.6162498Z #define __DEC64_MIN_EXP__ (-382) 2025-05-07T20:25:05.6162771Z #define __DBL_DIG__ 15 2025-05-07T20:25:05.6163002Z #define __FLT32_DIG__ 6 2025-05-07T20:25:05.6163307Z #define __FLT_EPSILON__ 1.19209289550781250000000000000000000e-7F 2025-05-07T20:25:05.6163675Z #define __SHRT_WIDTH__ 16 2025-05-07T20:25:05.6164026Z #define __FLT32_IS_IEC_60559__ 2 2025-05-07T20:25:05.6164356Z #define __LDBL_MIN__ 3.36210314311209350626267781732175260e-4932L 2025-05-07T20:25:05.6164724Z #define __STDC_UTF_16__ 1 2025-05-07T20:25:05.6164982Z #define __DBL_IS_IEC_60559__ 2 2025-05-07T20:25:05.6165245Z #define __DEC32_MAX__ 9.999999E96DF 2025-05-07T20:25:05.6165635Z #define __FLT64X_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951F64x 2025-05-07T20:25:05.6166064Z #define __FLT32X_HAS_INFINITY__ 1 2025-05-07T20:25:05.6166354Z #define __INT32_MAX__ 0x7fffffff 2025-05-07T20:25:05.6166614Z #define __unix__ 1 2025-05-07T20:25:05.6166843Z #define __INT_WIDTH__ 32 2025-05-07T20:25:05.6167096Z #define __SIZEOF_LONG__ 8 2025-05-07T20:25:05.6167346Z #define __STDC_IEC_559__ 1 2025-05-07T20:25:05.6167684Z #define __STDC_ISO_10646__ 201103L 2025-05-07T20:25:05.6167952Z #define __UINT16_C(c) c 2025-05-07T20:25:05.6168181Z #define __DECIMAL_DIG__ 21 2025-05-07T20:25:05.6168436Z #define __STDC_IEC_559_COMPLEX__ 1 2025-05-07T20:25:05.6168809Z #define __FLT64_EPSILON__ 2.22044604925031308084726333618164062e-16F64 2025-05-07T20:25:05.6169183Z #define __gnu_linux__ 1 2025-05-07T20:25:05.6169427Z #define __FLT128_IS_IEC_60559__ 2 2025-05-07T20:25:05.6169707Z #define __FLT64X_MIN_10_EXP__ (-4931) 2025-05-07T20:25:05.6169999Z #define __LDBL_HAS_QUIET_NAN__ 1 2025-05-07T20:25:05.6170263Z #define __FLT64_MANT_DIG__ 53 2025-05-07T20:25:05.6170528Z #define __FLT64X_MANT_DIG__ 64 2025-05-07T20:25:05.6170780Z #define __GNUC__ 11 2025-05-07T20:25:05.6170988Z #define __pie__ 2 2025-05-07T20:25:05.6171201Z #define __MMX__ 1 2025-05-07T20:25:05.6171420Z #define __FLT_HAS_DENORM__ 1 2025-05-07T20:25:05.6171683Z #define __SIZEOF_LONG_DOUBLE__ 16 2025-05-07T20:25:05.6171964Z #define __BIGGEST_ALIGNMENT__ 16 2025-05-07T20:25:05.6172248Z #define __FLT64_MAX_10_EXP__ 308 2025-05-07T20:25:05.6172600Z #define __DBL_MAX__ ((double)1.79769313486231570814527423731704357e+308L) 2025-05-07T20:25:05.6173018Z #define __INT_FAST32_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:05.6173350Z #define __DBL_HAS_INFINITY__ 1 2025-05-07T20:25:05.6173612Z #define __SIZEOF_FLOAT__ 4 2025-05-07T20:25:05.6173875Z #define __HAVE_SPECULATION_SAFE_VALUE 1 2025-05-07T20:25:05.6174182Z #define __DEC32_MIN_EXP__ (-94) 2025-05-07T20:25:05.6174449Z #define __INTPTR_WIDTH__ 64 2025-05-07T20:25:05.6174712Z #define __FLT64X_HAS_INFINITY__ 1 2025-05-07T20:25:05.6174996Z #define __UINT_LEAST32_MAX__ 0xffffffffU 2025-05-07T20:25:05.6175297Z #define __FLT32X_HAS_DENORM__ 1 2025-05-07T20:25:05.6175599Z #define __INT_FAST16_TYPE__ long int 2025-05-07T20:25:05.6175912Z #define __MMX_WITH_SSE__ 1 2025-05-07T20:25:05.6176167Z #define __LDBL_HAS_DENORM__ 1 2025-05-07T20:25:05.6176439Z #define __FLT128_HAS_INFINITY__ 1 2025-05-07T20:25:05.6176708Z #define __DEC32_MIN__ 1E-95DF 2025-05-07T20:25:05.6185567Z #define __DBL_MAX_EXP__ 1024 2025-05-07T20:25:05.6185839Z #define __WCHAR_WIDTH__ 32 2025-05-07T20:25:05.6186176Z #define __FLT32_MAX__ 3.40282346638528859811704183484516925e+38F32 2025-05-07T20:25:05.6186565Z #define __DEC128_EPSILON__ 1E-33DL 2025-05-07T20:25:05.6186847Z #define __SSE2_MATH__ 1 2025-05-07T20:25:05.6187091Z #define __ATOMIC_HLE_RELEASE 131072 2025-05-07T20:25:05.6187399Z #define __PTRDIFF_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:05.6187706Z #define __amd64 1 2025-05-07T20:25:05.6187930Z #define __STDC_NO_THREADS__ 1 2025-05-07T20:25:05.6188202Z #define __ATOMIC_HLE_ACQUIRE 65536 2025-05-07T20:25:05.6188514Z #define __LONG_LONG_MAX__ 0x7fffffffffffffffLL 2025-05-07T20:25:05.6188832Z #define __SIZEOF_SIZE_T__ 8 2025-05-07T20:25:05.6189096Z #define __FLT64X_MIN_EXP__ (-16381) 2025-05-07T20:25:05.6189379Z #define __SIZEOF_WINT_T__ 4 2025-05-07T20:25:05.6189631Z #define __LONG_LONG_WIDTH__ 64 2025-05-07T20:25:05.6190034Z #define __FLT32_MAX_EXP__ 128 2025-05-07T20:25:05.6190307Z #define __GXX_ABI_VERSION 1016 2025-05-07T20:25:05.6190564Z #define __FLT_MIN_EXP__ (-125) 2025-05-07T20:25:05.6190834Z #define __GCC_HAVE_DWARF2_CFI_ASM 1 2025-05-07T20:25:05.6191121Z #define __INT16_MAX__ 0x7fff 2025-05-07T20:25:05.6191545Z #define __x86_64 1 2025-05-07T20:25:05.6191764Z #define __INT_FAST64_TYPE__ long int 2025-05-07T20:25:05.6192138Z #define __FLT64_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F64 2025-05-07T20:25:05.6192613Z #define __DBL_MIN__ ((double)2.22507385850720138309023271733240406e-308L) 2025-05-07T20:25:05.6193081Z #define __FLT128_EPSILON__ 1.92592994438723585305597794258492732e-34F128 2025-05-07T20:25:05.6193580Z #define __FLT64X_NORM_MAX__ 1.18973149535723176502126385303097021e+4932F64x 2025-05-07T20:25:05.6193984Z #define __SIZEOF_POINTER__ 8 2025-05-07T20:25:05.6194235Z #define __LP64__ 1 2025-05-07T20:25:05.6194465Z #define __DBL_HAS_QUIET_NAN__ 1 2025-05-07T20:25:05.6194827Z #define __FLT32X_EPSILON__ 2.22044604925031308084726333618164062e-16F32x 2025-05-07T20:25:05.6195344Z #define __DECIMAL_BID_FORMAT__ 1 2025-05-07T20:25:05.6195617Z #define __FLT64_MIN_EXP__ (-1021) 2025-05-07T20:25:05.6195896Z #define __FLT64_MIN_10_EXP__ (-307) 2025-05-07T20:25:05.6196190Z #define __FLT64X_DECIMAL_DIG__ 21 2025-05-07T20:25:05.6196476Z #define __DEC128_MIN__ 1E-6143DL 2025-05-07T20:25:05.6196749Z #define __REGISTER_PREFIX__ 2025-05-07T20:25:05.6197011Z #define __UINT16_MAX__ 0xffff 2025-05-07T20:25:05.6197265Z #define __DBL_HAS_DENORM__ 1 2025-05-07T20:25:05.6197529Z #define __LDBL_HAS_INFINITY__ 1 2025-05-07T20:25:05.6197868Z #define __FLT32_MIN__ 1.17549435082228750796873653722224568e-38F32 2025-05-07T20:25:05.6198233Z #define __UINT8_TYPE__ unsigned char 2025-05-07T20:25:05.6198517Z #define __FLT_DIG__ 6 2025-05-07T20:25:05.6198747Z #define __NO_INLINE__ 1 2025-05-07T20:25:05.6198979Z #define __DEC_EVAL_METHOD__ 2 2025-05-07T20:25:05.6199314Z #define __DEC128_MAX__ 9.999999999999999999999999999999999E6144DL 2025-05-07T20:25:05.6199675Z #define __FLT_MANT_DIG__ 24 2025-05-07T20:25:05.6199939Z #define __LDBL_DECIMAL_DIG__ 21 2025-05-07T20:25:05.6200197Z #define __VERSION__ "11.4.0" 2025-05-07T20:25:05.6200453Z #define __UINT64_C(c) c ## UL 2025-05-07T20:25:05.6200712Z #define _STDC_PREDEF_H 1 2025-05-07T20:25:05.6200967Z #define __INT_LEAST32_MAX__ 0x7fffffff 2025-05-07T20:25:05.6201269Z #define __GCC_ATOMIC_INT_LOCK_FREE 2 2025-05-07T20:25:05.6201558Z #define __FLT128_MAX_EXP__ 16384 2025-05-07T20:25:05.6201821Z #define __FLT32_MANT_DIG__ 24 2025-05-07T20:25:05.6202128Z #define __FLOAT_WORD_ORDER__ __ORDER_LITTLE_ENDIAN__ 2025-05-07T20:25:05.6202471Z #define __FLT128_HAS_DENORM__ 1 2025-05-07T20:25:05.6202732Z #define __FLT32_DECIMAL_DIG__ 9 2025-05-07T20:25:05.6202996Z #define __FLT128_DIG__ 33 2025-05-07T20:25:05.6203241Z #define __INT32_C(c) c 2025-05-07T20:25:05.6203477Z #define __DEC64_EPSILON__ 1E-15DD 2025-05-07T20:25:05.6203766Z #define __ORDER_PDP_ENDIAN__ 3412 2025-05-07T20:25:05.6204049Z #define __DEC128_MIN_EXP__ (-6142) 2025-05-07T20:25:05.6204341Z #define __INT_FAST32_TYPE__ long int 2025-05-07T20:25:05.6204661Z #define __UINT_LEAST16_TYPE__ short unsigned int 2025-05-07T20:25:05.6204979Z #define unix 1 2025-05-07T20:25:05.6205207Z #define __SIZE_TYPE__ long unsigned int 2025-05-07T20:25:05.6205526Z #define __UINT64_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:05.6205838Z #define __FLT_IS_IEC_60559__ 2 2025-05-07T20:25:05.6206157Z #define __GNUC_WIDE_EXECUTION_CHARSET_NAME "UTF-32LE" 2025-05-07T20:25:05.6206487Z #define __FLT64X_DIG__ 18 2025-05-07T20:25:05.6206739Z #define __INT8_TYPE__ signed char 2025-05-07T20:25:05.6207005Z #define __ELF__ 1 2025-05-07T20:25:05.6207227Z #define __GCC_ASM_FLAG_OUTPUTS__ 1 2025-05-07T20:25:05.6207514Z #define __UINT32_TYPE__ unsigned int 2025-05-07T20:25:05.6207795Z #define __FLT_RADIX__ 2 2025-05-07T20:25:05.6208037Z #define __INT_LEAST16_TYPE__ short int 2025-05-07T20:25:05.6208410Z #define __LDBL_EPSILON__ 1.08420217248550443400745280086994171e-19L 2025-05-07T20:25:05.6208789Z #define __UINTMAX_C(c) c ## UL 2025-05-07T20:25:05.6209053Z #define __SSE_MATH__ 1 2025-05-07T20:25:05.6209271Z #define __k8 1 2025-05-07T20:25:05.6209572Z #define __FLT32X_MIN__ 2.22507385850720138309023271733240406e-308F32x 2025-05-07T20:25:05.6209960Z #define __SIG_ATOMIC_MAX__ 0x7fffffff 2025-05-07T20:25:05.6210345Z #define __GCC_ATOMIC_WCHAR_T_LOCK_FREE 2 2025-05-07T20:25:05.6210655Z #define __SIZEOF_PTRDIFF_T__ 8 2025-05-07T20:25:05.6210920Z #define __LDBL_DIG__ 18 2025-05-07T20:25:05.6211152Z #define __FLT64_IS_IEC_60559__ 2 2025-05-07T20:25:05.6211409Z #define __x86_64__ 1 2025-05-07T20:25:05.6211647Z #define __FLT32X_MIN_EXP__ (-1021) 2025-05-07T20:25:05.6211944Z #define __DEC32_SUBNORMAL_MIN__ 0.000001E-95DF 2025-05-07T20:25:05.6212284Z #define __INT_FAST16_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:05.6212595Z #define __FLT64_DIG__ 15 2025-05-07T20:25:05.6212875Z #define __UINT_FAST32_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:05.6213227Z #define __UINT_LEAST64_TYPE__ long unsigned int 2025-05-07T20:25:05.6213546Z #define __FLT_HAS_QUIET_NAN__ 1 2025-05-07T20:25:05.6213923Z #define __FLT_MAX_10_EXP__ 38 2025-05-07T20:25:05.6214194Z #define __LONG_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:05.6214495Z #define __FLT64X_HAS_DENORM__ 1 2025-05-07T20:25:05.6214879Z #define __DEC128_SUBNORMAL_MIN__ 0.000000000000000000000000000000001E-6143DL 2025-05-07T20:25:05.6215283Z #define __FLT_HAS_INFINITY__ 1 2025-05-07T20:25:05.6215576Z #define __GNUC_EXECUTION_CHARSET_NAME "UTF-8" 2025-05-07T20:25:05.6215918Z #define __UINT_FAST16_TYPE__ long unsigned int 2025-05-07T20:25:05.6216278Z #define __DEC64_MAX__ 9.999999999999999E384DD 2025-05-07T20:25:05.6216588Z #define __INT_FAST32_WIDTH__ 64 2025-05-07T20:25:05.6216866Z #define __CHAR16_TYPE__ short unsigned int 2025-05-07T20:25:05.6217179Z #define __PRAGMA_REDEFINE_EXTNAME 1 2025-05-07T20:25:05.6217461Z #define __SIZE_WIDTH__ 64 2025-05-07T20:25:05.6217696Z #define __SEG_FS 1 2025-05-07T20:25:05.6217929Z #define __INT_LEAST16_MAX__ 0x7fff 2025-05-07T20:25:05.6218210Z #define __DEC64_MANT_DIG__ 16 2025-05-07T20:25:05.6218490Z #define __INT64_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:05.6218782Z #define __SEG_GS 1 2025-05-07T20:25:05.6219103Z #define __FLT32_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F32 2025-05-07T20:25:05.6219506Z #define __SIG_ATOMIC_WIDTH__ 32 2025-05-07T20:25:05.6219774Z #define __INT_LEAST64_TYPE__ long int 2025-05-07T20:25:05.6220070Z #define __INT16_TYPE__ short int 2025-05-07T20:25:05.6220351Z #define __INT_LEAST8_TYPE__ signed char 2025-05-07T20:25:05.6220647Z #define __STDC_VERSION__ 201710L 2025-05-07T20:25:05.6220912Z #define __SIZEOF_INT__ 4 2025-05-07T20:25:05.6221158Z #define __DEC32_MAX_EXP__ 97 2025-05-07T20:25:05.6221414Z #define __INT_FAST8_MAX__ 0x7f 2025-05-07T20:25:05.6221764Z #define __FLT128_MAX__ 1.18973149535723176508575932662800702e+4932F128 2025-05-07T20:25:05.6222166Z #define __INTPTR_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:05.6222459Z #define linux 1 2025-05-07T20:25:05.6222691Z #define __FLT64_HAS_QUIET_NAN__ 1 2025-05-07T20:25:05.6222980Z #define __FLT32_MIN_10_EXP__ (-37) 2025-05-07T20:25:05.6223265Z #define __FLT32X_DIG__ 15 2025-05-07T20:25:05.6223515Z #define __PTRDIFF_WIDTH__ 64 2025-05-07T20:25:05.6223784Z #define __LDBL_MANT_DIG__ 64 2025-05-07T20:25:05.6224055Z #define __FLT64_HAS_INFINITY__ 1 2025-05-07T20:25:05.6224419Z #define __FLT64X_MAX__ 1.18973149535723176502126385303097021e+4932F64x 2025-05-07T20:25:05.6224855Z #define __SIG_ATOMIC_MIN__ (-__SIG_ATOMIC_MAX__ - 1) 2025-05-07T20:25:05.6225199Z #define __code_model_small__ 1 2025-05-07T20:25:05.6225473Z #define __GCC_ATOMIC_LONG_LOCK_FREE 2 2025-05-07T20:25:05.6225771Z #define __DEC32_MANT_DIG__ 7 2025-05-07T20:25:05.6226025Z #define __k8__ 1 2025-05-07T20:25:05.6226251Z #define __INTPTR_TYPE__ long int 2025-05-07T20:25:05.6226545Z #define __UINT16_TYPE__ short unsigned int 2025-05-07T20:25:05.6226856Z #define __WCHAR_TYPE__ int 2025-05-07T20:25:05.6227098Z #define __pic__ 2 2025-05-07T20:25:05.6227351Z #define __UINTPTR_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:05.6227673Z #define __INT_FAST64_WIDTH__ 64 2025-05-07T20:25:05.6227984Z #define __INT_FAST64_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:05.6228320Z #define __GCC_ATOMIC_TEST_AND_SET_TRUEVAL 1 2025-05-07T20:25:05.6228708Z #define __FLT_NORM_MAX__ 3.40282346638528859811704183484516925e+38F 2025-05-07T20:25:05.6229181Z #define __FLT32_HAS_INFINITY__ 1 2025-05-07T20:25:05.6229449Z #define __FLT64X_MAX_EXP__ 16384 2025-05-07T20:25:05.6229807Z #define __UINT_FAST64_TYPE__ long unsigned int 2025-05-07T20:25:05.6230128Z #define __INT_MAX__ 0x7fffffff 2025-05-07T20:25:05.6230376Z #define __linux__ 1 2025-05-07T20:25:05.6230607Z #define __INT64_TYPE__ long int 2025-05-07T20:25:05.6230872Z #define __FLT_MAX_EXP__ 128 2025-05-07T20:25:05.6231125Z #define __ORDER_BIG_ENDIAN__ 4321 2025-05-07T20:25:05.6231404Z #define __DBL_MANT_DIG__ 53 2025-05-07T20:25:05.6231666Z #define __SIZEOF_FLOAT128__ 16 2025-05-07T20:25:05.6231962Z #define __INT_LEAST64_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:05.6232291Z #define __GCC_ATOMIC_CHAR16_T_LOCK_FREE 2 2025-05-07T20:25:05.6232677Z #define __DEC64_MIN__ 1E-383DD 2025-05-07T20:25:05.6232951Z #define __WINT_TYPE__ unsigned int 2025-05-07T20:25:05.6233241Z #define __UINT_LEAST32_TYPE__ unsigned int 2025-05-07T20:25:05.6233543Z #define __SIZEOF_SHORT__ 2 2025-05-07T20:25:05.6233892Z #define __FLT32_NORM_MAX__ 3.40282346638528859811704183484516925e+38F32 2025-05-07T20:25:05.6234258Z #define __SSE__ 1 2025-05-07T20:25:05.6234482Z #define __LDBL_MIN_EXP__ (-16381) 2025-05-07T20:25:05.6234826Z #define __FLT64_MAX__ 1.79769313486231570814527423731704357e+308F64 2025-05-07T20:25:05.6235177Z #define __amd64__ 1 2025-05-07T20:25:05.6235405Z #define __WINT_WIDTH__ 32 2025-05-07T20:25:05.6235658Z #define __INT_LEAST8_MAX__ 0x7f 2025-05-07T20:25:05.6235922Z #define __INT_LEAST64_WIDTH__ 64 2025-05-07T20:25:05.6236200Z #define __LDBL_MAX_EXP__ 16384 2025-05-07T20:25:05.6236470Z #define __FLT32X_MAX_10_EXP__ 308 2025-05-07T20:25:05.6236745Z #define __SIZEOF_INT128__ 16 2025-05-07T20:25:05.6237001Z #define __FLT64X_IS_IEC_60559__ 2 2025-05-07T20:25:05.6237282Z #define __LDBL_MAX_10_EXP__ 4932 2025-05-07T20:25:05.6237555Z #define __ATOMIC_RELAXED 0 2025-05-07T20:25:05.6237905Z #define __DBL_EPSILON__ ((double)2.22044604925031308084726333618164062e-16L) 2025-05-07T20:25:05.6238401Z #define __FLT128_MIN__ 3.36210314311209350626267781732175260e-4932F128 2025-05-07T20:25:05.6238774Z #define _LP64 1 2025-05-07T20:25:05.6238981Z #define __UINT8_C(c) c 2025-05-07T20:25:05.6239219Z #define __FLT64_MAX_EXP__ 1024 2025-05-07T20:25:05.6239485Z #define __INT_LEAST32_TYPE__ int 2025-05-07T20:25:05.6239750Z #define __SIZEOF_WCHAR_T__ 4 2025-05-07T20:25:05.6240022Z #define __UINT64_TYPE__ long unsigned int 2025-05-07T20:25:05.6240323Z #define __GNUC_PATCHLEVEL__ 0 2025-05-07T20:25:05.6240690Z #define __FLT128_NORM_MAX__ 1.18973149535723176508575932662800702e+4932F128 2025-05-07T20:25:05.6241168Z #define __FLT64_NORM_MAX__ 1.79769313486231570814527423731704357e+308F64 2025-05-07T20:25:05.6241551Z #define __FLT128_HAS_QUIET_NAN__ 1 2025-05-07T20:25:05.6241853Z #define __INTMAX_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:05.6242161Z #define __INT_FAST8_TYPE__ signed char 2025-05-07T20:25:05.6242538Z #define __FLT64X_MIN__ 3.36210314311209350626267781732175260e-4932F64x 2025-05-07T20:25:05.6242920Z #define __GNUC_STDC_INLINE__ 1 2025-05-07T20:25:05.6243188Z #define __FLT64_HAS_DENORM__ 1 2025-05-07T20:25:05.6243525Z #define __FLT32_EPSILON__ 1.19209289550781250000000000000000000e-7F32 2025-05-07T20:25:05.6243903Z #define __DBL_DECIMAL_DIG__ 17 2025-05-07T20:25:05.6244162Z #define __STDC_UTF_32__ 1 2025-05-07T20:25:05.6244405Z #define __INT_FAST8_WIDTH__ 8 2025-05-07T20:25:05.6244661Z #define __FXSR__ 1 2025-05-07T20:25:05.6244965Z #define __FLT32X_MAX__ 1.79769313486231570814527423731704357e+308F32x 2025-05-07T20:25:05.6245436Z #define __DBL_NORM_MAX__ ((double)1.79769313486231570814527423731704357e+308L) 2025-05-07T20:25:05.6245858Z #define __BYTE_ORDER__ __ORDER_LITTLE_ENDIAN__ 2025-05-07T20:25:05.6246169Z #define __INTMAX_WIDTH__ 64 2025-05-07T20:25:05.6246419Z #define __UINT32_C(c) c ## U 2025-05-07T20:25:05.6246759Z #define __FLT_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F 2025-05-07T20:25:05.6247127Z #define __INT8_MAX__ 0x7f 2025-05-07T20:25:05.6247370Z #define __LONG_WIDTH__ 64 2025-05-07T20:25:05.6247692Z #define __PIC__ 2 2025-05-07T20:25:05.6247943Z #define __UINT_FAST32_TYPE__ long unsigned int 2025-05-07T20:25:05.6248356Z #define __FLT32X_NORM_MAX__ 1.79769313486231570814527423731704357e+308F32x 2025-05-07T20:25:05.6248748Z #define __CHAR32_TYPE__ unsigned int 2025-05-07T20:25:05.6249088Z #define __FLT_MAX__ 3.40282346638528859811704183484516925e+38F 2025-05-07T20:25:05.6249427Z #define __SSE2__ 1 2025-05-07T20:25:05.6249640Z #define __INT32_TYPE__ int 2025-05-07T20:25:05.6249897Z #define __SIZEOF_DOUBLE__ 8 2025-05-07T20:25:05.6250161Z #define __FLT_MIN_10_EXP__ (-37) 2025-05-07T20:25:05.6250495Z #define __FLT64_MIN__ 2.22507385850720138309023271733240406e-308F64 2025-05-07T20:25:05.6250866Z #define __INT_LEAST32_WIDTH__ 32 2025-05-07T20:25:05.6251219Z #define __INTMAX_TYPE__ long int 2025-05-07T20:25:05.6251495Z #define __DEC128_MAX_EXP__ 6145 2025-05-07T20:25:05.6251761Z #define __FLT32X_HAS_QUIET_NAN__ 1 2025-05-07T20:25:05.6252040Z #define __ATOMIC_CONSUME 1 2025-05-07T20:25:05.6252297Z #define __GNUC_MINOR__ 4 2025-05-07T20:25:05.6252540Z #define __INT_FAST16_WIDTH__ 64 2025-05-07T20:25:05.6252830Z #define __UINTMAX_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:05.6253131Z #define __PIE__ 2 2025-05-07T20:25:05.6253453Z #define __FLT32X_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F32x 2025-05-07T20:25:05.6253863Z #define __DBL_MAX_10_EXP__ 308 2025-05-07T20:25:05.6254217Z #define __LDBL_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951L 2025-05-07T20:25:05.6254589Z #define __INT16_C(c) c 2025-05-07T20:25:05.6254823Z #define __STDC__ 1 2025-05-07T20:25:05.6255062Z #define __PTRDIFF_TYPE__ long int 2025-05-07T20:25:05.6255332Z #define __ATOMIC_SEQ_CST 5 2025-05-07T20:25:05.6255594Z #define __FLT32X_MIN_10_EXP__ (-307) 2025-05-07T20:25:05.6255908Z #define __UINTPTR_TYPE__ long unsigned int 2025-05-07T20:25:05.6256266Z #define __DEC64_SUBNORMAL_MIN__ 0.000000000000001E-383DD 2025-05-07T20:25:05.6256607Z #define __DEC128_MANT_DIG__ 34 2025-05-07T20:25:05.6256884Z #define __LDBL_MIN_10_EXP__ (-4931) 2025-05-07T20:25:05.6257174Z #define __SIZEOF_LONG_LONG__ 8 2025-05-07T20:25:05.6257439Z #define __FLT128_DECIMAL_DIG__ 36 2025-05-07T20:25:05.6257728Z #define __GCC_ATOMIC_LLONG_LOCK_FREE 2 2025-05-07T20:25:05.6258030Z #define __FLT32_HAS_QUIET_NAN__ 1 2025-05-07T20:25:05.6258307Z #define __FLT_DECIMAL_DIG__ 9 2025-05-07T20:25:05.6258611Z #define __UINT_FAST16_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:05.6259023Z #define __LDBL_NORM_MAX__ 1.18973149535723176502126385303097021e+4932L 2025-05-07T20:25:05.6259406Z #define __GCC_ATOMIC_SHORT_LOCK_FREE 2 2025-05-07T20:25:05.6259716Z #define __UINT_FAST8_TYPE__ unsigned char 2025-05-07T20:25:05.6260021Z #define __ATOMIC_ACQ_REL 4 2025-05-07T20:25:05.6260273Z #define __ATOMIC_RELEASE 3 2025-05-07T20:25:05.6260439Z 2025-05-07T20:25:05.6781252Z 2025-05-07T20:25:05.6781605Z [INFO] Printing out all preprocessor defines in the C++ compiler ... 2025-05-07T20:25:05.6782253Z + conda run -n build_binary c++ -dM -E -x c++ - 2025-05-07T20:25:05.6782585Z 2025-05-07T20:25:07.5592642Z #define __DBL_MIN_EXP__ (-1021) 2025-05-07T20:25:07.5593141Z #define __cpp_attributes 200809L 2025-05-07T20:25:07.5593598Z #define __cpp_nontype_template_parameter_auto 201606L 2025-05-07T20:25:07.5594019Z #define __UINT_LEAST16_MAX__ 0xffff 2025-05-07T20:25:07.5594314Z #define __ATOMIC_ACQUIRE 2 2025-05-07T20:25:07.5594584Z #define __FLT128_MAX_10_EXP__ 4932 2025-05-07T20:25:07.5594929Z #define __FLT_MIN__ 1.17549435082228750796873653722224568e-38F 2025-05-07T20:25:07.5595288Z #define __GCC_IEC_559_COMPLEX 2 2025-05-07T20:25:07.5595575Z #define __cpp_aggregate_nsdmi 201304L 2025-05-07T20:25:07.5595944Z #define __UINT_LEAST8_TYPE__ unsigned char 2025-05-07T20:25:07.5596417Z #define __SIZEOF_FLOAT80__ 16 2025-05-07T20:25:07.5596809Z #define __INTMAX_C(c) c ## L 2025-05-07T20:25:07.5597152Z #define __CHAR_BIT__ 8 2025-05-07T20:25:07.5597458Z #define __UINT8_MAX__ 0xff 2025-05-07T20:25:07.5597712Z #define __SCHAR_WIDTH__ 8 2025-05-07T20:25:07.5597966Z #define __WINT_MAX__ 0xffffffffU 2025-05-07T20:25:07.5600066Z #define __FLT32_MIN_EXP__ (-125) 2025-05-07T20:25:07.5600367Z #define __cpp_static_assert 201411L 2025-05-07T20:25:07.5600656Z #define __ORDER_LITTLE_ENDIAN__ 1234 2025-05-07T20:25:07.5600965Z #define __SIZE_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:07.5601274Z #define __WCHAR_MAX__ 0x7fffffff 2025-05-07T20:25:07.5601568Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_1 1 2025-05-07T20:25:07.5601894Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_2 1 2025-05-07T20:25:07.5602226Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_4 1 2025-05-07T20:25:07.5602645Z #define __DBL_DENORM_MIN__ double(4.94065645841246544176568792868221372e-324L) 2025-05-07T20:25:07.5603068Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_8 1 2025-05-07T20:25:07.5603547Z #define __GCC_ATOMIC_CHAR_LOCK_FREE 2 2025-05-07T20:25:07.5603837Z #define __GCC_IEC_559 2 2025-05-07T20:25:07.5604080Z #define __FLT32X_DECIMAL_DIG__ 17 2025-05-07T20:25:07.5604359Z #define __FLT_EVAL_METHOD__ 0 2025-05-07T20:25:07.5604647Z #define __cpp_binary_literals 201304L 2025-05-07T20:25:07.5604937Z #define __FLT64_DECIMAL_DIG__ 17 2025-05-07T20:25:07.5605241Z #define __cpp_noexcept_function_type 201510L 2025-05-07T20:25:07.5605568Z #define __GCC_ATOMIC_CHAR32_T_LOCK_FREE 2 2025-05-07T20:25:07.5605885Z #define __cpp_variadic_templates 200704L 2025-05-07T20:25:07.5606216Z #define __UINT_FAST64_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:07.5606546Z #define __SIG_ATOMIC_TYPE__ int 2025-05-07T20:25:07.5606820Z #define __DBL_MIN_10_EXP__ (-307) 2025-05-07T20:25:07.5607096Z #define __FINITE_MATH_ONLY__ 0 2025-05-07T20:25:07.5607375Z #define __cpp_variable_templates 201304L 2025-05-07T20:25:07.5607677Z #define __FLT32X_MAX_EXP__ 1024 2025-05-07T20:25:07.5607940Z #define __FLT32_HAS_DENORM__ 1 2025-05-07T20:25:07.5608212Z #define __UINT_FAST8_MAX__ 0xff 2025-05-07T20:25:07.5608489Z #define __cpp_rvalue_reference 200610L 2025-05-07T20:25:07.5608821Z #define __cpp_nested_namespace_definitions 201411L 2025-05-07T20:25:07.5609155Z #define __DEC64_MAX_EXP__ 385 2025-05-07T20:25:07.5609420Z #define __INT8_C(c) c 2025-05-07T20:25:07.5609663Z #define __INT_LEAST8_WIDTH__ 8 2025-05-07T20:25:07.5609929Z #define __cpp_variadic_using 201611L 2025-05-07T20:25:07.5610255Z #define __UINT_LEAST64_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:07.5610587Z #define __INT_LEAST8_MAX__ 0x7f 2025-05-07T20:25:07.5610860Z #define __cpp_capture_star_this 201603L 2025-05-07T20:25:07.5611163Z #define __SHRT_MAX__ 0x7fff 2025-05-07T20:25:07.5611493Z #define __LDBL_MAX__ 1.18973149535723176502126385303097021e+4932L 2025-05-07T20:25:07.5611854Z #define __FLT64X_MAX_10_EXP__ 4932 2025-05-07T20:25:07.5612143Z #define __cpp_if_constexpr 201606L 2025-05-07T20:25:07.5612427Z #define __LDBL_IS_IEC_60559__ 2 2025-05-07T20:25:07.5612688Z #define __FLT64X_HAS_QUIET_NAN__ 1 2025-05-07T20:25:07.5612975Z #define __UINT_LEAST8_MAX__ 0xff 2025-05-07T20:25:07.5613254Z #define __GCC_ATOMIC_BOOL_LOCK_FREE 2 2025-05-07T20:25:07.5613655Z #define __FLT128_DENORM_MIN__ 6.47517511943802511092443895822764655e-4966F128 2025-05-07T20:25:07.5614092Z #define __UINTMAX_TYPE__ long unsigned int 2025-05-07T20:25:07.5614386Z #define __linux 1 2025-05-07T20:25:07.5614617Z #define __DEC32_EPSILON__ 1E-6DF 2025-05-07T20:25:07.5614892Z #define __FLT_EVAL_METHOD_TS_18661_3__ 0 2025-05-07T20:25:07.5615179Z #define __unix 1 2025-05-07T20:25:07.5615407Z #define __UINT32_MAX__ 0xffffffffU 2025-05-07T20:25:07.5615689Z #define __GXX_EXPERIMENTAL_CXX0X__ 1 2025-05-07T20:25:07.5615984Z #define __FLT128_MIN_EXP__ (-16381) 2025-05-07T20:25:07.5616265Z #define __WINT_MIN__ 0U 2025-05-07T20:25:07.5616506Z #define __FLT128_MIN_10_EXP__ (-4931) 2025-05-07T20:25:07.5616793Z #define __FLT32X_IS_IEC_60559__ 2 2025-05-07T20:25:07.5617073Z #define __INT_LEAST16_WIDTH__ 16 2025-05-07T20:25:07.5617346Z #define __SCHAR_MAX__ 0x7f 2025-05-07T20:25:07.5617602Z #define __FLT128_MANT_DIG__ 113 2025-05-07T20:25:07.5617891Z #define __WCHAR_MIN__ (-__WCHAR_MAX__ - 1) 2025-05-07T20:25:07.5618185Z #define __INT64_C(c) c ## L 2025-05-07T20:25:07.5618558Z #define __GCC_ATOMIC_POINTER_LOCK_FREE 2 2025-05-07T20:25:07.5618874Z #define __FLT32X_MANT_DIG__ 53 2025-05-07T20:25:07.5619159Z #define __GCC_ATOMIC_CHAR16_T_LOCK_FREE 2 2025-05-07T20:25:07.5619471Z #define __cpp_aligned_new 201606L 2025-05-07T20:25:07.5619762Z #define __USER_LABEL_PREFIX__ 2025-05-07T20:25:07.5620034Z #define __FLT32_MAX_10_EXP__ 38 2025-05-07T20:25:07.5620399Z #define __FLT64X_EPSILON__ 1.08420217248550443400745280086994171e-19F64x 2025-05-07T20:25:07.5620806Z #define __STDC_HOSTED__ 1 2025-05-07T20:25:07.5621067Z #define __DEC64_MIN_EXP__ (-382) 2025-05-07T20:25:07.5621346Z #define __cpp_decltype_auto 201304L 2025-05-07T20:25:07.5621631Z #define __DBL_DIG__ 15 2025-05-07T20:25:07.5621868Z #define __FLT32_DIG__ 6 2025-05-07T20:25:07.5622256Z #define __FLT_EPSILON__ 1.19209289550781250000000000000000000e-7F 2025-05-07T20:25:07.5622616Z #define __GXX_WEAK__ 1 2025-05-07T20:25:07.5622851Z #define __SHRT_WIDTH__ 16 2025-05-07T20:25:07.5623105Z #define __FLT32_IS_IEC_60559__ 2 2025-05-07T20:25:07.5623441Z #define __LDBL_MIN__ 3.36210314311209350626267781732175260e-4932L 2025-05-07T20:25:07.5623806Z #define __DBL_IS_IEC_60559__ 2 2025-05-07T20:25:07.5624075Z #define __DEC32_MAX__ 9.999999E96DF 2025-05-07T20:25:07.5624376Z #define __cpp_threadsafe_static_init 200806L 2025-05-07T20:25:07.5624713Z #define __cpp_enumerator_attributes 201411L 2025-05-07T20:25:07.5625129Z #define __FLT64X_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951F64x 2025-05-07T20:25:07.5625548Z #define __FLT32X_HAS_INFINITY__ 1 2025-05-07T20:25:07.5625839Z #define __INT32_MAX__ 0x7fffffff 2025-05-07T20:25:07.5626107Z #define __unix__ 1 2025-05-07T20:25:07.5626339Z #define __INT_WIDTH__ 32 2025-05-07T20:25:07.5626591Z #define __SIZEOF_LONG__ 8 2025-05-07T20:25:07.5626844Z #define __STDC_IEC_559__ 1 2025-05-07T20:25:07.5627109Z #define __STDC_ISO_10646__ 201103L 2025-05-07T20:25:07.5627385Z #define __UINT16_C(c) c 2025-05-07T20:25:07.5627621Z #define __DECIMAL_DIG__ 21 2025-05-07T20:25:07.5627906Z #define __STDC_IEC_559_COMPLEX__ 1 2025-05-07T20:25:07.5628287Z #define __FLT64_EPSILON__ 2.22044604925031308084726333618164062e-16F64 2025-05-07T20:25:07.5628667Z #define __gnu_linux__ 1 2025-05-07T20:25:07.5628915Z #define __INT16_MAX__ 0x7fff 2025-05-07T20:25:07.5629187Z #define __FLT64_MIN_EXP__ (-1021) 2025-05-07T20:25:07.5629473Z #define __FLT64X_MIN_10_EXP__ (-4931) 2025-05-07T20:25:07.5629982Z #define __LDBL_HAS_QUIET_NAN__ 1 2025-05-07T20:25:07.5630292Z #define __FLT64_MANT_DIG__ 53 2025-05-07T20:25:07.5630557Z #define __FLT64X_MANT_DIG__ 64 2025-05-07T20:25:07.5630825Z #define __GNUC__ 11 2025-05-07T20:25:07.5640541Z #define __GXX_RTTI 1 2025-05-07T20:25:07.5640829Z #define __pie__ 2 2025-05-07T20:25:07.5641047Z #define __MMX__ 1 2025-05-07T20:25:07.5641280Z #define __FLT_HAS_DENORM__ 1 2025-05-07T20:25:07.5641578Z #define __SIZEOF_LONG_DOUBLE__ 16 2025-05-07T20:25:07.5641869Z #define __BIGGEST_ALIGNMENT__ 16 2025-05-07T20:25:07.5642153Z #define __STDC_UTF_16__ 1 2025-05-07T20:25:07.5642419Z #define __FLT64_MAX_10_EXP__ 308 2025-05-07T20:25:07.5642739Z #define __cpp_delegating_constructors 200604L 2025-05-07T20:25:07.5643071Z #define __FLT32_HAS_INFINITY__ 1 2025-05-07T20:25:07.5643443Z #define __DBL_MAX__ double(1.79769313486231570814527423731704357e+308L) 2025-05-07T20:25:07.5643845Z #define __cpp_raw_strings 200710L 2025-05-07T20:25:07.5644161Z #define __INT_FAST32_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:07.5644498Z #define __DBL_HAS_INFINITY__ 1 2025-05-07T20:25:07.5644777Z #define __SIZEOF_FLOAT__ 4 2025-05-07T20:25:07.5645053Z #define __HAVE_SPECULATION_SAFE_VALUE 1 2025-05-07T20:25:07.5645380Z #define __cpp_fold_expressions 201603L 2025-05-07T20:25:07.5645693Z #define __DEC32_MIN_EXP__ (-94) 2025-05-07T20:25:07.5645965Z #define __INTPTR_WIDTH__ 64 2025-05-07T20:25:07.5646248Z #define __FLT64X_HAS_INFINITY__ 1 2025-05-07T20:25:07.5646548Z #define __UINT_LEAST32_MAX__ 0xffffffffU 2025-05-07T20:25:07.5646862Z #define __FLT32X_HAS_DENORM__ 1 2025-05-07T20:25:07.5647142Z #define __INT_FAST16_TYPE__ long int 2025-05-07T20:25:07.5647630Z #define __MMX_WITH_SSE__ 1 2025-05-07T20:25:07.5647895Z #define __LDBL_HAS_DENORM__ 1 2025-05-07T20:25:07.5648158Z #define __cplusplus 201703L 2025-05-07T20:25:07.5648433Z #define __cpp_ref_qualifiers 200710L 2025-05-07T20:25:07.5648728Z #define __DEC32_MIN__ 1E-95DF 2025-05-07T20:25:07.5648989Z #define __DEPRECATED 1 2025-05-07T20:25:07.5649258Z #define __cpp_rvalue_references 200610L 2025-05-07T20:25:07.5649567Z #define __DBL_MAX_EXP__ 1024 2025-05-07T20:25:07.5649826Z #define __WCHAR_WIDTH__ 32 2025-05-07T20:25:07.5650163Z #define __FLT32_MAX__ 3.40282346638528859811704183484516925e+38F32 2025-05-07T20:25:07.5650540Z #define __DEC128_EPSILON__ 1E-33DL 2025-05-07T20:25:07.5650817Z #define __SSE2_MATH__ 1 2025-05-07T20:25:07.5651082Z #define __ATOMIC_HLE_RELEASE 131072 2025-05-07T20:25:07.5651488Z #define __PTRDIFF_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:07.5651787Z #define __amd64 1 2025-05-07T20:25:07.5652009Z #define __STDC_NO_THREADS__ 1 2025-05-07T20:25:07.5652282Z #define __ATOMIC_HLE_ACQUIRE 65536 2025-05-07T20:25:07.5652562Z #define __GNUG__ 11 2025-05-07T20:25:07.5652818Z #define __LONG_LONG_MAX__ 0x7fffffffffffffffLL 2025-05-07T20:25:07.5653137Z #define __SIZEOF_SIZE_T__ 8 2025-05-07T20:25:07.5653396Z #define __cpp_nsdmi 200809L 2025-05-07T20:25:07.5653656Z #define __FLT64X_MIN_EXP__ (-16381) 2025-05-07T20:25:07.5653945Z #define __SIZEOF_WINT_T__ 4 2025-05-07T20:25:07.5654207Z #define __LONG_LONG_WIDTH__ 64 2025-05-07T20:25:07.5654483Z #define __cpp_initializer_lists 200806L 2025-05-07T20:25:07.5654786Z #define __FLT32_MAX_EXP__ 128 2025-05-07T20:25:07.5655058Z #define __cpp_hex_float 201603L 2025-05-07T20:25:07.5655329Z #define __GXX_ABI_VERSION 1016 2025-05-07T20:25:07.5655609Z #define __FLT128_HAS_INFINITY__ 1 2025-05-07T20:25:07.5655908Z #define __FLT_MIN_EXP__ (-125) 2025-05-07T20:25:07.5656180Z #define __GCC_HAVE_DWARF2_CFI_ASM 1 2025-05-07T20:25:07.5656466Z #define __x86_64 1 2025-05-07T20:25:07.5656709Z #define __cpp_lambdas 200907L 2025-05-07T20:25:07.5656992Z #define __INT_FAST64_TYPE__ long int 2025-05-07T20:25:07.5657386Z #define __FLT64_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F64 2025-05-07T20:25:07.5657804Z #define __cpp_template_auto 201606L 2025-05-07T20:25:07.5658189Z #define __DBL_MIN__ double(2.22507385850720138309023271733240406e-308L) 2025-05-07T20:25:07.5658668Z #define __FLT128_EPSILON__ 1.92592994438723585305597794258492732e-34F128 2025-05-07T20:25:07.5659179Z #define __FLT64X_NORM_MAX__ 1.18973149535723176502126385303097021e+4932F64x 2025-05-07T20:25:07.5659594Z #define __SIZEOF_POINTER__ 8 2025-05-07T20:25:07.5659855Z #define __LP64__ 1 2025-05-07T20:25:07.5660093Z #define __DBL_HAS_QUIET_NAN__ 1 2025-05-07T20:25:07.5660469Z #define __FLT32X_EPSILON__ 2.22044604925031308084726333618164062e-16F32x 2025-05-07T20:25:07.5660879Z #define __DECIMAL_BID_FORMAT__ 1 2025-05-07T20:25:07.5661160Z #define __FLT64_MIN_10_EXP__ (-307) 2025-05-07T20:25:07.5661458Z #define __FLT64X_DECIMAL_DIG__ 21 2025-05-07T20:25:07.5661747Z #define __DEC128_MIN__ 1E-6143DL 2025-05-07T20:25:07.5662025Z #define __REGISTER_PREFIX__ 2025-05-07T20:25:07.5662299Z #define __UINT16_MAX__ 0xffff 2025-05-07T20:25:07.5662573Z #define __LDBL_HAS_INFINITY__ 1 2025-05-07T20:25:07.5662914Z #define __FLT32_MIN__ 1.17549435082228750796873653722224568e-38F32 2025-05-07T20:25:07.5663297Z #define __UINT8_TYPE__ unsigned char 2025-05-07T20:25:07.5663589Z #define __FLT_DIG__ 6 2025-05-07T20:25:07.5663822Z #define __NO_INLINE__ 1 2025-05-07T20:25:07.5664073Z #define __DEC_EVAL_METHOD__ 2 2025-05-07T20:25:07.5664417Z #define __DEC128_MAX__ 9.999999999999999999999999999999999E6144DL 2025-05-07T20:25:07.5664791Z #define __FLT_MANT_DIG__ 24 2025-05-07T20:25:07.5665052Z #define __LDBL_DECIMAL_DIG__ 21 2025-05-07T20:25:07.5665332Z #define __VERSION__ "11.4.0" 2025-05-07T20:25:07.5665607Z #define __UINT64_C(c) c ## UL 2025-05-07T20:25:07.5665904Z #define __cpp_unicode_characters 201411L 2025-05-07T20:25:07.5666253Z #define _STDC_PREDEF_H 1 2025-05-07T20:25:07.5666520Z #define __INT_LEAST32_MAX__ 0x7fffffff 2025-05-07T20:25:07.5666958Z #define __GCC_ATOMIC_INT_LOCK_FREE 2 2025-05-07T20:25:07.5667255Z #define __FLT128_MAX_EXP__ 16384 2025-05-07T20:25:07.5667529Z #define __FLT32_MANT_DIG__ 24 2025-05-07T20:25:07.5667905Z #define __FLOAT_WORD_ORDER__ __ORDER_LITTLE_ENDIAN__ 2025-05-07T20:25:07.5668256Z #define __cpp_aggregate_bases 201603L 2025-05-07T20:25:07.5668553Z #define __FLT128_HAS_DENORM__ 1 2025-05-07T20:25:07.5668827Z #define __FLT32_DECIMAL_DIG__ 9 2025-05-07T20:25:07.5669092Z #define __FLT128_DIG__ 33 2025-05-07T20:25:07.5669329Z #define __INT32_C(c) c 2025-05-07T20:25:07.5669580Z #define __DEC64_EPSILON__ 1E-15DD 2025-05-07T20:25:07.5670077Z #define __ORDER_PDP_ENDIAN__ 3412 2025-05-07T20:25:07.5670362Z #define __DEC128_MIN_EXP__ (-6142) 2025-05-07T20:25:07.5670801Z #define __INT_FAST32_TYPE__ long int 2025-05-07T20:25:07.5671128Z #define __UINT_LEAST16_TYPE__ short unsigned int 2025-05-07T20:25:07.5671438Z #define unix 1 2025-05-07T20:25:07.5671669Z #define __DBL_HAS_DENORM__ 1 2025-05-07T20:25:07.5671946Z #define __cpp_rtti 199711L 2025-05-07T20:25:07.5672222Z #define __SIZE_TYPE__ long unsigned int 2025-05-07T20:25:07.5672544Z #define __UINT64_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:07.5672864Z #define __FLT_IS_IEC_60559__ 2 2025-05-07T20:25:07.5673186Z #define __GNUC_WIDE_EXECUTION_CHARSET_NAME "UTF-32LE" 2025-05-07T20:25:07.5673525Z #define __FLT64X_DIG__ 18 2025-05-07T20:25:07.5673792Z #define __INT8_TYPE__ signed char 2025-05-07T20:25:07.5674094Z #define __cpp_digit_separators 201309L 2025-05-07T20:25:07.5674382Z #define __ELF__ 1 2025-05-07T20:25:07.5674628Z #define __GCC_ASM_FLAG_OUTPUTS__ 1 2025-05-07T20:25:07.5674923Z #define __UINT32_TYPE__ unsigned int 2025-05-07T20:25:07.5675200Z #define __FLT_RADIX__ 2 2025-05-07T20:25:07.5675457Z #define __INT_LEAST16_TYPE__ short int 2025-05-07T20:25:07.5675854Z #define __LDBL_EPSILON__ 1.08420217248550443400745280086994171e-19L 2025-05-07T20:25:07.5676236Z #define __UINTMAX_C(c) c ## UL 2025-05-07T20:25:07.5676519Z #define __GLIBCXX_BITSIZE_INT_N_0 128 2025-05-07T20:25:07.5676816Z #define __k8 1 2025-05-07T20:25:07.5677127Z #define __FLT32X_MIN__ 2.22507385850720138309023271733240406e-308F32x 2025-05-07T20:25:07.5677515Z #define __SIG_ATOMIC_MAX__ 0x7fffffff 2025-05-07T20:25:07.5677825Z #define __GCC_ATOMIC_WCHAR_T_LOCK_FREE 2 2025-05-07T20:25:07.5678141Z #define __SIZEOF_PTRDIFF_T__ 8 2025-05-07T20:25:07.5678407Z #define __LDBL_DIG__ 18 2025-05-07T20:25:07.5678660Z #define __FLT64_IS_IEC_60559__ 2 2025-05-07T20:25:07.5678928Z #define __x86_64__ 1 2025-05-07T20:25:07.5679168Z #define __FLT32X_MIN_EXP__ (-1021) 2025-05-07T20:25:07.5679491Z #define __DEC32_SUBNORMAL_MIN__ 0.000001E-95DF 2025-05-07T20:25:07.5679849Z #define __INT_FAST16_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:07.5680163Z #define __FLT64_DIG__ 15 2025-05-07T20:25:07.5680462Z #define __UINT_FAST32_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:07.5680829Z #define __UINT_LEAST64_TYPE__ long unsigned int 2025-05-07T20:25:07.5681159Z #define __FLT_HAS_QUIET_NAN__ 1 2025-05-07T20:25:07.5681429Z #define __FLT_MAX_10_EXP__ 38 2025-05-07T20:25:07.5681722Z #define __LONG_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:07.5682033Z #define __FLT64X_HAS_DENORM__ 1 2025-05-07T20:25:07.5682412Z #define __DEC128_SUBNORMAL_MIN__ 0.000000000000000000000000000000001E-6143DL 2025-05-07T20:25:07.5683200Z #define __FLT_HAS_INFINITY__ 1 2025-05-07T20:25:07.5683514Z #define __GNUC_EXECUTION_CHARSET_NAME "UTF-8" 2025-05-07T20:25:07.5683841Z #define __cpp_unicode_literals 200710L 2025-05-07T20:25:07.5684167Z #define __UINT_FAST16_TYPE__ long unsigned int 2025-05-07T20:25:07.5684508Z #define __DEC64_MAX__ 9.999999999999999E384DD 2025-05-07T20:25:07.5684815Z #define __INT_FAST32_WIDTH__ 64 2025-05-07T20:25:07.5685107Z #define __CHAR16_TYPE__ short unsigned int 2025-05-07T20:25:07.5685439Z #define __PRAGMA_REDEFINE_EXTNAME 1 2025-05-07T20:25:07.5685737Z #define __SIZE_WIDTH__ 64 2025-05-07T20:25:07.5685979Z #define __SEG_FS 1 2025-05-07T20:25:07.5686219Z #define __INT_LEAST16_MAX__ 0x7fff 2025-05-07T20:25:07.5686505Z #define __DEC64_MANT_DIG__ 16 2025-05-07T20:25:07.5687026Z #define __INT64_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:07.5687325Z #define __SEG_GS 1 2025-05-07T20:25:07.5687650Z #define __FLT32_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F32 2025-05-07T20:25:07.5688043Z #define __SIG_ATOMIC_WIDTH__ 32 2025-05-07T20:25:07.5688321Z #define __INT_LEAST64_TYPE__ long int 2025-05-07T20:25:07.5688623Z #define __INT16_TYPE__ short int 2025-05-07T20:25:07.5688901Z #define __INT_LEAST8_TYPE__ signed char 2025-05-07T20:25:07.5689224Z #define __cpp_structured_bindings 201606L 2025-05-07T20:25:07.5689538Z #define __SIZEOF_INT__ 4 2025-05-07T20:25:07.5689792Z #define __DEC32_MAX_EXP__ 97 2025-05-07T20:25:07.5690070Z #define __INT_FAST8_MAX__ 0x7f 2025-05-07T20:25:07.5690431Z #define __FLT128_MAX__ 1.18973149535723176508575932662800702e+4932F128 2025-05-07T20:25:07.5690984Z #define __INTPTR_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:07.5691307Z #define __cpp_sized_deallocation 201309L 2025-05-07T20:25:07.5691654Z #define __cpp_guaranteed_copy_elision 201606L 2025-05-07T20:25:07.5691967Z #define linux 1 2025-05-07T20:25:07.5692192Z #define __FLT64_HAS_QUIET_NAN__ 1 2025-05-07T20:25:07.5692482Z #define __FLT32_MIN_10_EXP__ (-37) 2025-05-07T20:25:07.5692767Z #define __EXCEPTIONS 1 2025-05-07T20:25:07.5693012Z #define __PTRDIFF_WIDTH__ 64 2025-05-07T20:25:07.5693286Z #define __LDBL_MANT_DIG__ 64 2025-05-07T20:25:07.5693576Z #define __cpp_range_based_for 201603L 2025-05-07T20:25:07.5693871Z #define __FLT64_HAS_INFINITY__ 1 2025-05-07T20:25:07.5694234Z #define __FLT64X_MAX__ 1.18973149535723176502126385303097021e+4932F64x 2025-05-07T20:25:07.5694648Z #define __STDCPP_DEFAULT_NEW_ALIGNMENT__ 16 2025-05-07T20:25:07.5695015Z #define __SIG_ATOMIC_MIN__ (-__SIG_ATOMIC_MAX__ - 1) 2025-05-07T20:25:07.5695360Z #define __code_model_small__ 1 2025-05-07T20:25:07.5695645Z #define __GCC_ATOMIC_LONG_LOCK_FREE 2 2025-05-07T20:25:07.5695968Z #define __cpp_nontype_template_args 201411L 2025-05-07T20:25:07.5696282Z #define __DEC32_MANT_DIG__ 7 2025-05-07T20:25:07.5696578Z #define __cpp_return_type_deduction 201304L 2025-05-07T20:25:07.5696880Z #define __k8__ 1 2025-05-07T20:25:07.5697107Z #define __INTPTR_TYPE__ long int 2025-05-07T20:25:07.5697404Z #define __UINT16_TYPE__ short unsigned int 2025-05-07T20:25:07.5697719Z #define __WCHAR_TYPE__ int 2025-05-07T20:25:07.5697979Z #define __pic__ 2 2025-05-07T20:25:07.5698229Z #define __UINTPTR_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:07.5698551Z #define __INT_FAST64_WIDTH__ 64 2025-05-07T20:25:07.5698832Z #define __cpp_decltype 200707L 2025-05-07T20:25:07.5699124Z #define __INT_FAST64_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:07.5699465Z #define __GCC_ATOMIC_TEST_AND_SET_TRUEVAL 1 2025-05-07T20:25:07.5699849Z #define __FLT_NORM_MAX__ 3.40282346638528859811704183484516925e+38F 2025-05-07T20:25:07.5700224Z #define __FLT64X_MAX_EXP__ 16384 2025-05-07T20:25:07.5700532Z #define __UINT_FAST64_TYPE__ long unsigned int 2025-05-07T20:25:07.5700865Z #define __cpp_inline_variables 201606L 2025-05-07T20:25:07.5701167Z #define __INT_MAX__ 0x7fffffff 2025-05-07T20:25:07.5701427Z #define __linux__ 1 2025-05-07T20:25:07.5701661Z #define __INT64_TYPE__ long int 2025-05-07T20:25:07.5701933Z #define __FLT_MAX_EXP__ 128 2025-05-07T20:25:07.5702197Z #define __ORDER_BIG_ENDIAN__ 4321 2025-05-07T20:25:07.5702480Z #define __DBL_MANT_DIG__ 53 2025-05-07T20:25:07.5702773Z #define __cpp_inheriting_constructors 201511L 2025-05-07T20:25:07.5703096Z #define __SIZEOF_FLOAT128__ 16 2025-05-07T20:25:07.5703400Z #define __INT_LEAST64_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:07.5703727Z #define __DEC64_MIN__ 1E-383DD 2025-05-07T20:25:07.5703995Z #define __WINT_TYPE__ unsigned int 2025-05-07T20:25:07.5704296Z #define __UINT_LEAST32_TYPE__ unsigned int 2025-05-07T20:25:07.5704605Z #define __SIZEOF_SHORT__ 2 2025-05-07T20:25:07.5704950Z #define __FLT32_NORM_MAX__ 3.40282346638528859811704183484516925e+38F32 2025-05-07T20:25:07.5705332Z #define __SSE__ 1 2025-05-07T20:25:07.5705565Z #define __LDBL_MIN_EXP__ (-16381) 2025-05-07T20:25:07.5706005Z #define __FLT64_MAX__ 1.79769313486231570814527423731704357e+308F64 2025-05-07T20:25:07.5706366Z #define __amd64__ 1 2025-05-07T20:25:07.5706592Z #define __WINT_WIDTH__ 32 2025-05-07T20:25:07.5706846Z #define __INT_LEAST64_WIDTH__ 64 2025-05-07T20:25:07.5707114Z #define __LDBL_MAX_EXP__ 16384 2025-05-07T20:25:07.5707383Z #define __FLT32X_MAX_10_EXP__ 308 2025-05-07T20:25:07.5707660Z #define __SIZEOF_INT128__ 16 2025-05-07T20:25:07.5707917Z #define __FLT64X_IS_IEC_60559__ 2 2025-05-07T20:25:07.5708194Z #define __LDBL_MAX_10_EXP__ 4932 2025-05-07T20:25:07.5708465Z #define __ATOMIC_RELAXED 0 2025-05-07T20:25:07.5708814Z #define __DBL_EPSILON__ double(2.22044604925031308084726333618164062e-16L) 2025-05-07T20:25:07.5709299Z #define __FLT128_MIN__ 3.36210314311209350626267781732175260e-4932F128 2025-05-07T20:25:07.5709882Z #define _LP64 1 2025-05-07T20:25:07.5710096Z #define __UINT8_C(c) c 2025-05-07T20:25:07.5710342Z #define __FLT64_MAX_EXP__ 1024 2025-05-07T20:25:07.5710620Z #define __INT_LEAST32_TYPE__ int 2025-05-07T20:25:07.5710900Z #define __SIZEOF_WCHAR_T__ 4 2025-05-07T20:25:07.5711173Z #define __GNUC_PATCHLEVEL__ 0 2025-05-07T20:25:07.5711548Z #define __FLT128_NORM_MAX__ 1.18973149535723176508575932662800702e+4932F128 2025-05-07T20:25:07.5712043Z #define __FLT64_NORM_MAX__ 1.79769313486231570814527423731704357e+308F64 2025-05-07T20:25:07.5712432Z #define __FLT128_HAS_QUIET_NAN__ 1 2025-05-07T20:25:07.5712738Z #define __INTMAX_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:07.5713064Z #define __INT_FAST8_TYPE__ signed char 2025-05-07T20:25:07.5713384Z #define __cpp_namespace_attributes 201411L 2025-05-07T20:25:07.5713786Z #define __FLT64X_MIN__ 3.36210314311209350626267781732175260e-4932F64x 2025-05-07T20:25:07.5714180Z #define __STDCPP_THREADS__ 1 2025-05-07T20:25:07.5714456Z #define __GNUC_STDC_INLINE__ 1 2025-05-07T20:25:07.5714728Z #define __FLT64_HAS_DENORM__ 1 2025-05-07T20:25:07.5715086Z #define __FLT32_EPSILON__ 1.19209289550781250000000000000000000e-7F32 2025-05-07T20:25:07.5715473Z #define __DBL_DECIMAL_DIG__ 17 2025-05-07T20:25:07.5715741Z #define __STDC_UTF_32__ 1 2025-05-07T20:25:07.5715998Z #define __INT_FAST8_WIDTH__ 8 2025-05-07T20:25:07.5716256Z #define __FXSR__ 1 2025-05-07T20:25:07.5716561Z #define __FLT32X_MAX__ 1.79769313486231570814527423731704357e+308F32x 2025-05-07T20:25:07.5717040Z #define __DBL_NORM_MAX__ double(1.79769313486231570814527423731704357e+308L) 2025-05-07T20:25:07.5717471Z #define __BYTE_ORDER__ __ORDER_LITTLE_ENDIAN__ 2025-05-07T20:25:07.5717786Z #define __INTMAX_WIDTH__ 64 2025-05-07T20:25:07.5718059Z #define __cpp_runtime_arrays 198712L 2025-05-07T20:25:07.5718370Z #define __UINT64_TYPE__ long unsigned int 2025-05-07T20:25:07.5718672Z #define __UINT32_C(c) c ## U 2025-05-07T20:25:07.5718952Z #define __cpp_alias_templates 200704L 2025-05-07T20:25:07.5719339Z #define __FLT_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F 2025-05-07T20:25:07.5719727Z #define __FLT128_IS_IEC_60559__ 2 2025-05-07T20:25:07.5719998Z #define __INT8_MAX__ 0x7f 2025-05-07T20:25:07.5720254Z #define __LONG_WIDTH__ 64 2025-05-07T20:25:07.5720504Z #define __PIC__ 2 2025-05-07T20:25:07.5720757Z #define __UINT_FAST32_TYPE__ long unsigned int 2025-05-07T20:25:07.5721181Z #define __FLT32X_NORM_MAX__ 1.79769313486231570814527423731704357e+308F32x 2025-05-07T20:25:07.5721590Z #define __CHAR32_TYPE__ unsigned int 2025-05-07T20:25:07.5721938Z #define __FLT_MAX__ 3.40282346638528859811704183484516925e+38F 2025-05-07T20:25:07.5722302Z #define __cpp_constexpr 201603L 2025-05-07T20:25:07.5722571Z #define __SSE2__ 1 2025-05-07T20:25:07.5722808Z #define __cpp_deduction_guides 201703L 2025-05-07T20:25:07.5723114Z #define __INT32_TYPE__ int 2025-05-07T20:25:07.5723376Z #define __SIZEOF_DOUBLE__ 8 2025-05-07T20:25:07.5723649Z #define __cpp_exceptions 199711L 2025-05-07T20:25:07.5723932Z #define __FLT_MIN_10_EXP__ (-37) 2025-05-07T20:25:07.5724287Z #define __FLT64_MIN__ 2.22507385850720138309023271733240406e-308F64 2025-05-07T20:25:07.5724666Z #define __INT_LEAST32_WIDTH__ 32 2025-05-07T20:25:07.5724942Z #define __INTMAX_TYPE__ long int 2025-05-07T20:25:07.5725321Z #define __DEC128_MAX_EXP__ 6145 2025-05-07T20:25:07.5725601Z #define __FLT32X_HAS_QUIET_NAN__ 1 2025-05-07T20:25:07.5725900Z #define __ATOMIC_CONSUME 1 2025-05-07T20:25:07.5726178Z #define __GNUC_MINOR__ 4 2025-05-07T20:25:07.5726434Z #define __GLIBCXX_TYPE_INT_N_0 __int128 2025-05-07T20:25:07.5726731Z #define __INT_FAST16_WIDTH__ 64 2025-05-07T20:25:07.5727030Z #define __UINTMAX_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:07.5727344Z #define __PIE__ 2 2025-05-07T20:25:07.5727674Z #define __FLT32X_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F32x 2025-05-07T20:25:07.5728113Z #define __cpp_template_template_args 201611L 2025-05-07T20:25:07.5728434Z #define __DBL_MAX_10_EXP__ 308 2025-05-07T20:25:07.5728798Z #define __LDBL_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951L 2025-05-07T20:25:07.5729347Z #define __INT16_C(c) c 2025-05-07T20:25:07.5729575Z #define __STDC__ 1 2025-05-07T20:25:07.5729795Z #define __FLT32X_DIG__ 15 2025-05-07T20:25:07.5730049Z #define __PTRDIFF_TYPE__ long int 2025-05-07T20:25:07.5730323Z #define __ATOMIC_SEQ_CST 5 2025-05-07T20:25:07.5730581Z #define __FLT32X_MIN_10_EXP__ (-307) 2025-05-07T20:25:07.5730877Z #define __UINTPTR_TYPE__ long unsigned int 2025-05-07T20:25:07.5731233Z #define __DEC64_SUBNORMAL_MIN__ 0.000000000000001E-383DD 2025-05-07T20:25:07.5731582Z #define __DEC128_MANT_DIG__ 34 2025-05-07T20:25:07.5731849Z #define __LDBL_MIN_10_EXP__ (-4931) 2025-05-07T20:25:07.5732150Z #define __cpp_generic_lambdas 201304L 2025-05-07T20:25:07.5732441Z #define __SSE_MATH__ 1 2025-05-07T20:25:07.5732688Z #define __SIZEOF_LONG_LONG__ 8 2025-05-07T20:25:07.5732976Z #define __cpp_user_defined_literals 200809L 2025-05-07T20:25:07.5733297Z #define __FLT128_DECIMAL_DIG__ 36 2025-05-07T20:25:07.5733597Z #define __GCC_ATOMIC_LLONG_LOCK_FREE 2 2025-05-07T20:25:07.5733898Z #define __FLT32_HAS_QUIET_NAN__ 1 2025-05-07T20:25:07.5734181Z #define __FLT_DECIMAL_DIG__ 9 2025-05-07T20:25:07.5734489Z #define __UINT_FAST16_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:07.5734901Z #define __LDBL_NORM_MAX__ 1.18973149535723176502126385303097021e+4932L 2025-05-07T20:25:07.5735297Z #define __GCC_ATOMIC_SHORT_LOCK_FREE 2 2025-05-07T20:25:07.5735612Z #define __UINT_FAST8_TYPE__ unsigned char 2025-05-07T20:25:07.5735910Z #define _GNU_SOURCE 1 2025-05-07T20:25:07.5736169Z #define __cpp_init_captures 201304L 2025-05-07T20:25:07.5736457Z #define __ATOMIC_ACQ_REL 4 2025-05-07T20:25:07.5736716Z #define __ATOMIC_RELEASE 3 2025-05-07T20:25:07.5736880Z 2025-05-07T20:25:07.6245196Z 2025-05-07T20:25:07.6245925Z + conda run -n build_binary c++ --version 2025-05-07T20:25:07.6246252Z 2025-05-07T20:25:09.5189446Z c++ (conda-forge gcc 11.4.0-13) 11.4.0 2025-05-07T20:25:09.5190202Z Copyright (C) 2021 Free Software Foundation, Inc. 2025-05-07T20:25:09.5190906Z This is free software; see the source for copying conditions. There is NO 2025-05-07T20:25:09.5191711Z warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. 2025-05-07T20:25:09.5192204Z 2025-05-07T20:25:09.5192210Z 2025-05-07T20:25:09.5821058Z 2025-05-07T20:25:09.5821884Z [INFO] Printing the default version of the C standard used by the compiler ... 2025-05-07T20:25:09.5822490Z + conda run -n build_binary cc -dM -E - < /dev/null | grep __STDC_VERSION__ 2025-05-07T20:25:09.5822813Z 2025-05-07T20:25:11.5343311Z #define __STDC_VERSION__ 201710L 2025-05-07T20:25:11.5346806Z 2025-05-07T20:25:11.5347169Z [INFO] Printing the default version of the C++ standard used by the compiler ... 2025-05-07T20:25:11.5347755Z + conda run -n build_binary c++ -dM -E -x c++ - < /dev/null | grep __cplusplus 2025-05-07T20:25:11.5348087Z 2025-05-07T20:25:13.4883019Z #define __cplusplus 201703L 2025-05-07T20:25:13.4885358Z 2025-05-07T20:25:13.4886070Z [INSTALL] Successfully installed C/C++ compilers 2025-05-07T20:25:13.4931226Z ##[group]Run . $PRELUDE; install_cuda $BUILD_ENV 12.8.0 2025-05-07T20:25:13.4931647Z . $PRELUDE; install_cuda $BUILD_ENV 12.8.0 2025-05-07T20:25:13.4944617Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:25:13.4944974Z env: 2025-05-07T20:25:13.4945198Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:25:13.4945495Z BUILD_ENV: build_binary 2025-05-07T20:25:13.4945740Z BUILD_TARGET: genai 2025-05-07T20:25:13.4945970Z BUILD_VARIANT: cuda 2025-05-07T20:25:13.4946199Z BUILD_CUDA_VERSION: 12.8.0 2025-05-07T20:25:13.4946455Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:25:13.4946758Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:25:13.4947090Z ##[endgroup] 2025-05-07T20:25:13.8298402Z ################################################################################ 2025-05-07T20:25:13.8298775Z # Install CUDA 2025-05-07T20:25:13.8298982Z # 2025-05-07T20:25:13.8315157Z # [2025-05-07T20:25:13.831Z] + install_cuda build_binary 12.8.0 2025-05-07T20:25:13.8315879Z ################################################################################ 2025-05-07T20:25:13.8316103Z 2025-05-07T20:25:13.8330920Z [EXEC] [ATTEMPT 0/3] + wget -q --timeout 1 pypi.org -O /dev/null 2025-05-07T20:25:13.9240524Z [CHECK] Network does not appear to be blocked. 2025-05-07T20:25:13.9240879Z [SETUP] Cleaning up Conda packages ... 2025-05-07T20:25:13.9245601Z + conda clean --packages --tarball -y 2025-05-07T20:25:13.9245924Z 2025-05-07T20:25:14.6336363Z Will remove 32 (140.4 MB) tarball(s). 2025-05-07T20:25:14.6336707Z Will remove 6 (617 KB) package(s). 2025-05-07T20:25:14.7133826Z 2025-05-07T20:25:14.7142138Z + conda clean --all -y 2025-05-07T20:25:14.7142316Z 2025-05-07T20:25:15.3825238Z There are no unused tarball(s) to remove. 2025-05-07T20:25:15.3825919Z Will remove 1 index cache(s). 2025-05-07T20:25:15.3826489Z There are no unused package(s) to remove. 2025-05-07T20:25:15.3827115Z There are no tempfile(s) to remove. 2025-05-07T20:25:15.3827725Z There are no logfile(s) to remove. 2025-05-07T20:25:15.4464107Z 2025-05-07T20:25:15.4478141Z [INSTALL] Installing CUDA 12.8.0 ... 2025-05-07T20:25:15.4502292Z [EXEC] [ATTEMPT 0/3] + conda install --force-reinstall -n build_binary -c conda-forge --override-channels -y cuda=12.8.0 2025-05-07T20:25:16.3554719Z Channels: 2025-05-07T20:25:16.3555301Z - conda-forge 2025-05-07T20:25:26.8573259Z Platform: linux-64 2025-05-07T20:25:26.8574912Z Collecting package metadata (repodata.json): - \ | / - \ | / - \ | / - \ | / - \ done 2025-05-07T20:25:27.9781591Z Solving environment: / - \ | done 2025-05-07T20:25:28.0516181Z 2025-05-07T20:25:28.0516806Z ## Package Plan ## 2025-05-07T20:25:28.0517042Z 2025-05-07T20:25:28.0517339Z environment location: /home/ec2-user/miniconda/envs/build_binary 2025-05-07T20:25:28.0517677Z 2025-05-07T20:25:28.0517779Z added / updated specs: 2025-05-07T20:25:28.0518047Z - cuda=12.8.0 2025-05-07T20:25:28.0518212Z 2025-05-07T20:25:28.0518234Z 2025-05-07T20:25:28.0518366Z The following packages will be downloaded: 2025-05-07T20:25:28.0518652Z 2025-05-07T20:25:28.0518812Z package | build 2025-05-07T20:25:28.0519277Z ---------------------------|----------------- 2025-05-07T20:25:28.0519746Z alsa-lib-1.2.14 | hb9d3cd8_0 553 KB conda-forge 2025-05-07T20:25:28.0520410Z attr-2.5.1 | h166bdaf_1 69 KB conda-forge 2025-05-07T20:25:28.0520987Z binutils-2.40 | h4852527_7 31 KB conda-forge 2025-05-07T20:25:28.0521507Z bzip2-1.0.8 | h4bc722e_7 247 KB conda-forge 2025-05-07T20:25:28.0521942Z c-compiler-1.5.2 | h0b41bf4_0 6 KB conda-forge 2025-05-07T20:25:28.0522395Z cuda-12.8.0 | ha804496_0 26 KB conda-forge 2025-05-07T20:25:28.0522969Z cuda-cccl_linux-64-12.8.55 | ha770c72_1 1.0 MB conda-forge 2025-05-07T20:25:28.0524189Z cuda-command-line-tools-12.8.0| ha770c72_0 20 KB conda-forge 2025-05-07T20:25:28.0524741Z cuda-compiler-12.8.0 | hbad6d8a_0 20 KB conda-forge 2025-05-07T20:25:28.0525249Z cuda-crt-dev_linux-64-12.8.61| ha770c72_1 90 KB conda-forge 2025-05-07T20:25:28.0525752Z cuda-crt-tools-12.8.61 | ha770c72_1 27 KB conda-forge 2025-05-07T20:25:28.0526223Z cuda-cudart-12.8.57 | h5888daf_1 22 KB conda-forge 2025-05-07T20:25:28.0526720Z cuda-cudart-dev-12.8.57 | h5888daf_1 23 KB conda-forge 2025-05-07T20:25:28.0527246Z cuda-cudart-dev_linux-64-12.8.57| h3f2d84a_1 377 KB conda-forge 2025-05-07T20:25:28.0527784Z cuda-cudart-static-12.8.57 | h5888daf_1 22 KB conda-forge 2025-05-07T20:25:28.0528335Z cuda-cudart-static_linux-64-12.8.57| h3f2d84a_1 950 KB conda-forge 2025-05-07T20:25:28.0529059Z cuda-cudart_linux-64-12.8.57| h3f2d84a_1 188 KB conda-forge 2025-05-07T20:25:28.0529573Z cuda-cuobjdump-12.8.55 | hbd13f7d_0 227 KB conda-forge 2025-05-07T20:25:28.0530050Z cuda-cupti-12.8.57 | hbd13f7d_0 1.8 MB conda-forge 2025-05-07T20:25:28.0530522Z cuda-cupti-dev-12.8.57 | h5888daf_0 4.0 MB conda-forge 2025-05-07T20:25:28.0531019Z cuda-cuxxfilt-12.8.55 | hbd13f7d_0 211 KB conda-forge 2025-05-07T20:25:28.0531559Z cuda-driver-dev-12.8.57 | h5888daf_1 22 KB conda-forge 2025-05-07T20:25:28.0532082Z cuda-driver-dev_linux-64-12.8.90| h3f2d84a_1 36 KB conda-forge 2025-05-07T20:25:28.0532584Z cuda-gdb-12.8.55 | h50b4baa_0 353 KB conda-forge 2025-05-07T20:25:28.0533060Z cuda-libraries-12.8.0 | ha770c72_0 20 KB conda-forge 2025-05-07T20:25:28.0533574Z cuda-libraries-dev-12.8.0 | ha770c72_0 20 KB conda-forge 2025-05-07T20:25:28.0534080Z cuda-nsight-12.8.55 | h7938cbb_0 113.2 MB conda-forge 2025-05-07T20:25:28.0534547Z cuda-nvcc-12.8.61 | hcdd1206_0 23 KB conda-forge 2025-05-07T20:25:28.0535043Z cuda-nvcc-dev_linux-64-12.8.61| he91c749_1 12.7 MB conda-forge 2025-05-07T20:25:28.0535550Z cuda-nvcc-impl-12.8.61 | h85509e4_1 25 KB conda-forge 2025-05-07T20:25:28.0536047Z cuda-nvcc-tools-12.8.61 | he02047a_1 24.5 MB conda-forge 2025-05-07T20:25:28.0536551Z cuda-nvcc_linux-64-12.8.61 | h04802cd_0 25 KB conda-forge 2025-05-07T20:25:28.0537043Z cuda-nvdisasm-12.8.55 | hbd13f7d_0 4.9 MB conda-forge 2025-05-07T20:25:28.0537529Z cuda-nvml-dev-12.8.55 | hbd13f7d_0 134 KB conda-forge 2025-05-07T20:25:28.0538008Z cuda-nvprof-12.8.57 | hbd13f7d_0 2.5 MB conda-forge 2025-05-07T20:25:28.0538504Z cuda-nvprune-12.8.55 | hbd13f7d_0 68 KB conda-forge 2025-05-07T20:25:28.0538978Z cuda-nvrtc-12.8.61 | hbd13f7d_0 63.1 MB conda-forge 2025-05-07T20:25:28.0539455Z cuda-nvrtc-dev-12.8.61 | h5888daf_0 34 KB conda-forge 2025-05-07T20:25:28.0539926Z cuda-nvtx-12.8.55 | hbd13f7d_0 31 KB conda-forge 2025-05-07T20:25:28.0540420Z cuda-nvvm-dev_linux-64-12.8.61| ha770c72_1 25 KB conda-forge 2025-05-07T20:25:28.0540925Z cuda-nvvm-impl-12.8.61 | he02047a_1 20.8 MB conda-forge 2025-05-07T20:25:28.0541422Z cuda-nvvm-tools-12.8.61 | he02047a_1 23.5 MB conda-forge 2025-05-07T20:25:28.0541904Z cuda-nvvp-12.8.57 | hbd13f7d_0 112.4 MB conda-forge 2025-05-07T20:25:28.0542365Z cuda-opencl-12.8.55 | hbd13f7d_0 29 KB conda-forge 2025-05-07T20:25:28.0542858Z cuda-opencl-dev-12.8.55 | h5888daf_0 95 KB conda-forge 2025-05-07T20:25:28.0543502Z cuda-profiler-api-12.8.55 | h7938cbb_0 22 KB conda-forge 2025-05-07T20:25:28.0544003Z cuda-runtime-12.8.0 | ha804496_0 20 KB conda-forge 2025-05-07T20:25:28.0544501Z cuda-sanitizer-api-12.8.55 | hbd13f7d_0 8.8 MB conda-forge 2025-05-07T20:25:28.0545002Z cuda-toolkit-12.8.0 | ha804496_0 20 KB conda-forge 2025-05-07T20:25:28.0545465Z cuda-tools-12.8.0 | ha770c72_0 19 KB conda-forge 2025-05-07T20:25:28.0545920Z cuda-version-12.8 | h5d125a7_3 21 KB conda-forge 2025-05-07T20:25:28.0546416Z cuda-visual-tools-12.8.0 | ha770c72_0 20 KB conda-forge 2025-05-07T20:25:28.0546917Z cxx-compiler-1.5.2 | hf52228f_0 6 KB conda-forge 2025-05-07T20:25:28.0547361Z dbus-1.13.6 | h5008d03_3 604 KB conda-forge 2025-05-07T20:25:28.0547774Z expat-2.7.0 | h5888daf_0 137 KB conda-forge 2025-05-07T20:25:28.0548355Z font-ttf-dejavu-sans-mono-2.37| hab24e00_0 388 KB conda-forge 2025-05-07T20:25:28.0548911Z font-ttf-inconsolata-3.000 | h77eed37_0 94 KB conda-forge 2025-05-07T20:25:28.0549455Z font-ttf-source-code-pro-2.038| h77eed37_0 684 KB conda-forge 2025-05-07T20:25:28.0550139Z font-ttf-ubuntu-0.83 | h77eed37_3 1.5 MB conda-forge 2025-05-07T20:25:28.0550627Z fontconfig-2.15.0 | h7e30c49_1 259 KB conda-forge 2025-05-07T20:25:28.0551132Z fonts-conda-ecosystem-1 | 0 4 KB conda-forge 2025-05-07T20:25:28.0551641Z fonts-conda-forge-1 | 0 4 KB conda-forge 2025-05-07T20:25:28.0552119Z freetype-2.13.3 | ha770c72_1 168 KB conda-forge 2025-05-07T20:25:28.0552550Z gcc-11.4.0 | h602e360_13 49 KB conda-forge 2025-05-07T20:25:28.0552997Z gds-tools-1.13.0.11 | h5888daf_0 37.9 MB conda-forge 2025-05-07T20:25:28.0553423Z gmp-6.3.0 | hac33072_2 449 KB conda-forge 2025-05-07T20:25:28.0553827Z gxx-11.4.0 | h602e360_13 49 KB conda-forge 2025-05-07T20:25:28.0554259Z keyutils-1.6.1 | h166bdaf_0 115 KB conda-forge 2025-05-07T20:25:28.0554685Z krb5-1.21.3 | h659f571_0 1.3 MB conda-forge 2025-05-07T20:25:28.0555107Z libcap-2.71 | h39aace5_0 100 KB conda-forge 2025-05-07T20:25:28.0555563Z libcublas-12.8.3.14 | h9ab20c4_0 460.2 MB conda-forge 2025-05-07T20:25:28.0556053Z libcublas-dev-12.8.3.14 | h9ab20c4_0 89 KB conda-forge 2025-05-07T20:25:28.0556530Z libcufft-11.3.3.41 | hbd13f7d_0 147.4 MB conda-forge 2025-05-07T20:25:28.0557010Z libcufft-dev-11.3.3.41 | h5888daf_0 33 KB conda-forge 2025-05-07T20:25:28.0557500Z libcufile-1.13.0.11 | h12f29b5_0 939 KB conda-forge 2025-05-07T20:25:28.0557978Z libcufile-dev-1.13.0.11 | h5888daf_0 35 KB conda-forge 2025-05-07T20:25:28.0558458Z libcurand-10.3.9.55 | hbd13f7d_0 43.6 MB conda-forge 2025-05-07T20:25:28.0558935Z libcurand-dev-10.3.9.55 | h5888daf_0 265 KB conda-forge 2025-05-07T20:25:28.0559450Z libcusolver-11.7.2.55 | h9ab20c4_0 156.9 MB conda-forge 2025-05-07T20:25:28.0559956Z libcusolver-dev-11.7.2.55 | h9ab20c4_0 59 KB conda-forge 2025-05-07T20:25:28.0560464Z libcusparse-12.5.7.53 | hbd13f7d_0 164.9 MB conda-forge 2025-05-07T20:25:28.0560962Z libcusparse-dev-12.5.7.53 | h5888daf_0 51 KB conda-forge 2025-05-07T20:25:28.0561470Z libedit-3.1.20191231 | he28a2e2_2 121 KB conda-forge 2025-05-07T20:25:28.0562048Z libexpat-2.7.0 | h5888daf_0 73 KB conda-forge 2025-05-07T20:25:28.0562506Z libfreetype-2.13.3 | ha770c72_1 8 KB conda-forge 2025-05-07T20:25:28.0562997Z libfreetype6-2.13.3 | h48d6fc4_1 371 KB conda-forge 2025-05-07T20:25:28.0563489Z libgcrypt-lib-1.11.0 | hb9d3cd8_2 572 KB conda-forge 2025-05-07T20:25:28.0563960Z libglib-2.84.0 | h2ff4ddf_0 3.8 MB conda-forge 2025-05-07T20:25:28.0564401Z libglvnd-1.7.0 | ha4b6fd6_2 129 KB conda-forge 2025-05-07T20:25:28.0564868Z libgpg-error-1.55 | h3f2d84a_0 305 KB conda-forge 2025-05-07T20:25:28.0565334Z libiconv-1.18 | h4ce23a2_1 696 KB conda-forge 2025-05-07T20:25:28.0565765Z libnl-3.11.0 | hb9d3cd8_0 724 KB conda-forge 2025-05-07T20:25:28.0566211Z libnpp-12.3.3.65 | hbd13f7d_0 130.6 MB conda-forge 2025-05-07T20:25:28.0566755Z libnpp-dev-12.3.3.65 | h5888daf_0 443 KB conda-forge 2025-05-07T20:25:28.0567210Z libnsl-2.0.1 | hd590300_0 33 KB conda-forge 2025-05-07T20:25:28.0567641Z libnuma-2.0.18 | h4ab18f5_2 42 KB conda-forge 2025-05-07T20:25:28.0568105Z libnvfatbin-12.8.55 | hbd13f7d_0 793 KB conda-forge 2025-05-07T20:25:28.0568598Z libnvfatbin-dev-12.8.55 | h5888daf_0 26 KB conda-forge 2025-05-07T20:25:28.0569094Z libnvjitlink-12.8.61 | hbd13f7d_0 28.7 MB conda-forge 2025-05-07T20:25:28.0569600Z libnvjitlink-dev-12.8.61 | h5888daf_0 25 KB conda-forge 2025-05-07T20:25:28.0570095Z libnvjpeg-12.3.5.57 | h97fd463_0 3.0 MB conda-forge 2025-05-07T20:25:28.0570577Z libnvjpeg-dev-12.3.5.57 | ha770c72_0 31 KB conda-forge 2025-05-07T20:25:28.0571066Z libopengl-1.7.0 | ha4b6fd6_2 50 KB conda-forge 2025-05-07T20:25:28.0571513Z libpng-1.6.47 | h943b412_0 282 KB conda-forge 2025-05-07T20:25:28.0571958Z libsqlite-3.49.2 | hee588c1_0 895 KB conda-forge 2025-05-07T20:25:28.0572427Z libsystemd0-256.9 | h2774228_0 401 KB conda-forge 2025-05-07T20:25:28.0572898Z libudev1-257.4 | h9a4d06a_0 140 KB conda-forge 2025-05-07T20:25:28.0573347Z libuuid-2.38.1 | h0b41bf4_0 33 KB conda-forge 2025-05-07T20:25:28.0573783Z libxcb-1.17.0 | h8a09558_0 387 KB conda-forge 2025-05-07T20:25:28.0574237Z libxkbcommon-1.8.0 | hc4a0caf_0 627 KB conda-forge 2025-05-07T20:25:28.0574717Z libxkbfile-1.1.0 | h166bdaf_1 111 KB conda-forge 2025-05-07T20:25:28.0575174Z libxml2-2.13.5 | h064dc61_0 673 KB conda-forge 2025-05-07T20:25:28.0575620Z libzlib-1.3.1 | hb9d3cd8_2 60 KB conda-forge 2025-05-07T20:25:28.0576042Z lz4-c-1.9.4 | hcb278e6_0 140 KB conda-forge 2025-05-07T20:25:28.0576513Z nsight-compute-2025.1.0.14 | hb5ebaad_0 320.6 MB conda-forge 2025-05-07T20:25:28.0576986Z nspr-4.36 | h5888daf_0 225 KB conda-forge 2025-05-07T20:25:28.0577396Z nss-3.111 | h159eef7_0 1.9 MB conda-forge 2025-05-07T20:25:28.0577822Z ocl-icd-2.3.3 | hb9d3cd8_0 104 KB conda-forge 2025-05-07T20:25:28.0578304Z opencl-headers-2024.10.24 | h5888daf_0 53 KB conda-forge 2025-05-07T20:25:28.0578775Z pcre2-10.44 | hc749103_2 934 KB conda-forge 2025-05-07T20:25:28.0579231Z pthread-stubs-0.4 | hb9d3cd8_1002 8 KB conda-forge 2025-05-07T20:25:28.0579817Z python-3.9.18 |h0755675_1_cpython 22.7 MB conda-forge 2025-05-07T20:25:28.0580281Z rdma-core-55.0 | h5888daf_0 1.2 MB conda-forge 2025-05-07T20:25:28.0580717Z sqlite-3.32.3 | hcee41ef_1 1.4 MB conda-forge 2025-05-07T20:25:28.0581148Z tk-8.6.13 |noxft_h4845f30_101 3.2 MB conda-forge 2025-05-07T20:25:28.0581580Z wayland-1.23.1 | h3e06ad9_0 314 KB conda-forge 2025-05-07T20:25:28.0582022Z xcb-util-0.4.1 | hb711507_2 19 KB conda-forge 2025-05-07T20:25:28.0582488Z xcb-util-cursor-0.1.5 | hb9d3cd8_0 20 KB conda-forge 2025-05-07T20:25:28.0583301Z xcb-util-image-0.4.0 | hb711507_2 24 KB conda-forge 2025-05-07T20:25:28.0583798Z xcb-util-keysyms-0.4.1 | hb711507_0 14 KB conda-forge 2025-05-07T20:25:28.0584319Z xcb-util-renderutil-0.3.10 | hb711507_0 17 KB conda-forge 2025-05-07T20:25:28.0584977Z xcb-util-wm-0.4.2 | hb711507_0 50 KB conda-forge 2025-05-07T20:25:28.0585468Z xkeyboard-config-2.44 | hb9d3cd8_0 384 KB conda-forge 2025-05-07T20:25:28.0585959Z xorg-libice-1.1.2 | hb9d3cd8_0 57 KB conda-forge 2025-05-07T20:25:28.0586415Z xorg-libsm-1.2.6 | he73a12e_0 27 KB conda-forge 2025-05-07T20:25:28.0586879Z xorg-libx11-1.8.12 | h4f16b4b_0 816 KB conda-forge 2025-05-07T20:25:28.0587351Z xorg-libxau-1.0.12 | hb9d3cd8_0 14 KB conda-forge 2025-05-07T20:25:28.0587859Z xorg-libxcomposite-0.4.6 | hb9d3cd8_2 13 KB conda-forge 2025-05-07T20:25:28.0588373Z xorg-libxdamage-1.1.6 | hb9d3cd8_0 13 KB conda-forge 2025-05-07T20:25:28.0588871Z xorg-libxdmcp-1.1.5 | hb9d3cd8_0 19 KB conda-forge 2025-05-07T20:25:28.0589369Z xorg-libxext-1.3.6 | hb9d3cd8_0 49 KB conda-forge 2025-05-07T20:25:28.0589980Z xorg-libxfixes-6.0.1 | hb9d3cd8_0 19 KB conda-forge 2025-05-07T20:25:28.0590458Z xorg-libxi-1.8.2 | hb9d3cd8_0 46 KB conda-forge 2025-05-07T20:25:28.0590961Z xorg-libxrandr-1.5.4 | hb9d3cd8_0 29 KB conda-forge 2025-05-07T20:25:28.0591487Z xorg-libxrender-0.9.12 | hb9d3cd8_0 32 KB conda-forge 2025-05-07T20:25:28.0591971Z xorg-libxtst-1.2.5 | hb9d3cd8_3 32 KB conda-forge 2025-05-07T20:25:28.0592415Z zlib-1.3.1 | hb9d3cd8_2 90 KB conda-forge 2025-05-07T20:25:28.0592830Z zstd-1.5.7 | hb8e6e7a_2 554 KB conda-forge 2025-05-07T20:25:28.0593232Z ------------------------------------------------------------ 2025-05-07T20:25:28.0593604Z Total: 1.90 GB 2025-05-07T20:25:28.0593841Z 2025-05-07T20:25:28.0593981Z The following NEW packages will be INSTALLED: 2025-05-07T20:25:28.0594214Z 2025-05-07T20:25:28.0594432Z alsa-lib conda-forge/linux-64::alsa-lib-1.2.14-hb9d3cd8_0 2025-05-07T20:25:28.0594878Z attr conda-forge/linux-64::attr-2.5.1-h166bdaf_1 2025-05-07T20:25:28.0595325Z binutils conda-forge/linux-64::binutils-2.40-h4852527_7 2025-05-07T20:25:28.0595780Z bzip2 conda-forge/linux-64::bzip2-1.0.8-h4bc722e_7 2025-05-07T20:25:28.0596249Z c-compiler conda-forge/linux-64::c-compiler-1.5.2-h0b41bf4_0 2025-05-07T20:25:28.0596696Z cuda conda-forge/noarch::cuda-12.8.0-ha804496_0 2025-05-07T20:25:28.0597187Z cuda-cccl_linux-64 conda-forge/noarch::cuda-cccl_linux-64-12.8.55-ha770c72_1 2025-05-07T20:25:28.0597816Z cuda-command-line~ conda-forge/linux-64::cuda-command-line-tools-12.8.0-ha770c72_0 2025-05-07T20:25:28.0598430Z cuda-compiler conda-forge/noarch::cuda-compiler-12.8.0-hbad6d8a_0 2025-05-07T20:25:28.0599146Z cuda-crt-dev_linu~ conda-forge/noarch::cuda-crt-dev_linux-64-12.8.61-ha770c72_1 2025-05-07T20:25:28.0599748Z cuda-crt-tools conda-forge/linux-64::cuda-crt-tools-12.8.61-ha770c72_1 2025-05-07T20:25:28.0600288Z cuda-cudart conda-forge/linux-64::cuda-cudart-12.8.57-h5888daf_1 2025-05-07T20:25:28.0600842Z cuda-cudart-dev conda-forge/linux-64::cuda-cudart-dev-12.8.57-h5888daf_1 2025-05-07T20:25:28.0601445Z cuda-cudart-dev_l~ conda-forge/noarch::cuda-cudart-dev_linux-64-12.8.57-h3f2d84a_1 2025-05-07T20:25:28.0602251Z cuda-cudart-static conda-forge/linux-64::cuda-cudart-static-12.8.57-h5888daf_1 2025-05-07T20:25:28.0602908Z cuda-cudart-stati~ conda-forge/noarch::cuda-cudart-static_linux-64-12.8.57-h3f2d84a_1 2025-05-07T20:25:28.0603551Z cuda-cudart_linux~ conda-forge/noarch::cuda-cudart_linux-64-12.8.57-h3f2d84a_1 2025-05-07T20:25:28.0604137Z cuda-cuobjdump conda-forge/linux-64::cuda-cuobjdump-12.8.55-hbd13f7d_0 2025-05-07T20:25:28.0604690Z cuda-cupti conda-forge/linux-64::cuda-cupti-12.8.57-hbd13f7d_0 2025-05-07T20:25:28.0605307Z cuda-cupti-dev conda-forge/linux-64::cuda-cupti-dev-12.8.57-h5888daf_0 2025-05-07T20:25:28.0605862Z cuda-cuxxfilt conda-forge/linux-64::cuda-cuxxfilt-12.8.55-hbd13f7d_0 2025-05-07T20:25:28.0606429Z cuda-driver-dev conda-forge/linux-64::cuda-driver-dev-12.8.57-h5888daf_1 2025-05-07T20:25:28.0607037Z cuda-driver-dev_l~ conda-forge/noarch::cuda-driver-dev_linux-64-12.8.90-h3f2d84a_1 2025-05-07T20:25:28.0607604Z cuda-gdb conda-forge/linux-64::cuda-gdb-12.8.55-h50b4baa_0 2025-05-07T20:25:28.0608267Z cuda-libraries conda-forge/linux-64::cuda-libraries-12.8.0-ha770c72_0 2025-05-07T20:25:28.0608873Z cuda-libraries-dev conda-forge/linux-64::cuda-libraries-dev-12.8.0-ha770c72_0 2025-05-07T20:25:28.0609452Z cuda-nsight conda-forge/linux-64::cuda-nsight-12.8.55-h7938cbb_0 2025-05-07T20:25:28.0609955Z cuda-nvcc conda-forge/linux-64::cuda-nvcc-12.8.61-hcdd1206_0 2025-05-07T20:25:28.0610512Z cuda-nvcc-dev_lin~ conda-forge/noarch::cuda-nvcc-dev_linux-64-12.8.61-he91c749_1 2025-05-07T20:25:28.0611110Z cuda-nvcc-impl conda-forge/linux-64::cuda-nvcc-impl-12.8.61-h85509e4_1 2025-05-07T20:25:28.0611727Z cuda-nvcc-tools conda-forge/linux-64::cuda-nvcc-tools-12.8.61-he02047a_1 2025-05-07T20:25:28.0612452Z cuda-nvcc_linux-64 conda-forge/linux-64::cuda-nvcc_linux-64-12.8.61-h04802cd_0 2025-05-07T20:25:28.0613027Z cuda-nvdisasm conda-forge/linux-64::cuda-nvdisasm-12.8.55-hbd13f7d_0 2025-05-07T20:25:28.0613578Z cuda-nvml-dev conda-forge/linux-64::cuda-nvml-dev-12.8.55-hbd13f7d_0 2025-05-07T20:25:28.0614121Z cuda-nvprof conda-forge/linux-64::cuda-nvprof-12.8.57-hbd13f7d_0 2025-05-07T20:25:28.0614651Z cuda-nvprune conda-forge/linux-64::cuda-nvprune-12.8.55-hbd13f7d_0 2025-05-07T20:25:28.0615172Z cuda-nvrtc conda-forge/linux-64::cuda-nvrtc-12.8.61-hbd13f7d_0 2025-05-07T20:25:28.0615708Z cuda-nvrtc-dev conda-forge/linux-64::cuda-nvrtc-dev-12.8.61-h5888daf_0 2025-05-07T20:25:28.0616240Z cuda-nvtx conda-forge/linux-64::cuda-nvtx-12.8.55-hbd13f7d_0 2025-05-07T20:25:28.0616788Z cuda-nvvm-dev_lin~ conda-forge/noarch::cuda-nvvm-dev_linux-64-12.8.61-ha770c72_1 2025-05-07T20:25:28.0617374Z cuda-nvvm-impl conda-forge/linux-64::cuda-nvvm-impl-12.8.61-he02047a_1 2025-05-07T20:25:28.0617945Z cuda-nvvm-tools conda-forge/linux-64::cuda-nvvm-tools-12.8.61-he02047a_1 2025-05-07T20:25:28.0618478Z cuda-nvvp conda-forge/linux-64::cuda-nvvp-12.8.57-hbd13f7d_0 2025-05-07T20:25:28.0618984Z cuda-opencl conda-forge/linux-64::cuda-opencl-12.8.55-hbd13f7d_0 2025-05-07T20:25:28.0619640Z cuda-opencl-dev conda-forge/linux-64::cuda-opencl-dev-12.8.55-h5888daf_0 2025-05-07T20:25:28.0620277Z cuda-profiler-api conda-forge/linux-64::cuda-profiler-api-12.8.55-h7938cbb_0 2025-05-07T20:25:28.0620857Z cuda-runtime conda-forge/noarch::cuda-runtime-12.8.0-ha804496_0 2025-05-07T20:25:28.0621555Z cuda-sanitizer-api conda-forge/linux-64::cuda-sanitizer-api-12.8.55-hbd13f7d_0 2025-05-07T20:25:28.0622151Z cuda-toolkit conda-forge/noarch::cuda-toolkit-12.8.0-ha804496_0 2025-05-07T20:25:28.0622659Z cuda-tools conda-forge/linux-64::cuda-tools-12.8.0-ha770c72_0 2025-05-07T20:25:28.0623161Z cuda-version conda-forge/noarch::cuda-version-12.8-h5d125a7_3 2025-05-07T20:25:28.0623716Z cuda-visual-tools conda-forge/linux-64::cuda-visual-tools-12.8.0-ha770c72_0 2025-05-07T20:25:28.0624297Z cxx-compiler conda-forge/linux-64::cxx-compiler-1.5.2-hf52228f_0 2025-05-07T20:25:28.0624774Z dbus conda-forge/linux-64::dbus-1.13.6-h5008d03_3 2025-05-07T20:25:28.0625198Z expat conda-forge/linux-64::expat-2.7.0-h5888daf_0 2025-05-07T20:25:28.0625737Z font-ttf-dejavu-s~ conda-forge/noarch::font-ttf-dejavu-sans-mono-2.37-hab24e00_0 2025-05-07T20:25:28.0626382Z font-ttf-inconsol~ conda-forge/noarch::font-ttf-inconsolata-3.000-h77eed37_0 2025-05-07T20:25:28.0627141Z font-ttf-source-c~ conda-forge/noarch::font-ttf-source-code-pro-2.038-h77eed37_0 2025-05-07T20:25:28.0627749Z font-ttf-ubuntu conda-forge/noarch::font-ttf-ubuntu-0.83-h77eed37_3 2025-05-07T20:25:28.0628282Z fontconfig conda-forge/linux-64::fontconfig-2.15.0-h7e30c49_1 2025-05-07T20:25:28.0628801Z fonts-conda-ecosy~ conda-forge/noarch::fonts-conda-ecosystem-1-0 2025-05-07T20:25:28.0629318Z fonts-conda-forge conda-forge/noarch::fonts-conda-forge-1-0 2025-05-07T20:25:28.0629890Z freetype conda-forge/linux-64::freetype-2.13.3-ha770c72_1 2025-05-07T20:25:28.0630338Z gcc conda-forge/linux-64::gcc-11.4.0-h602e360_13 2025-05-07T20:25:28.0630784Z gds-tools conda-forge/linux-64::gds-tools-1.13.0.11-h5888daf_0 2025-05-07T20:25:28.0631290Z gmp conda-forge/linux-64::gmp-6.3.0-hac33072_2 2025-05-07T20:25:28.0631689Z gxx conda-forge/linux-64::gxx-11.4.0-h602e360_13 2025-05-07T20:25:28.0632130Z keyutils conda-forge/linux-64::keyutils-1.6.1-h166bdaf_0 2025-05-07T20:25:28.0632576Z krb5 conda-forge/linux-64::krb5-1.21.3-h659f571_0 2025-05-07T20:25:28.0633004Z libcap conda-forge/linux-64::libcap-2.71-h39aace5_0 2025-05-07T20:25:28.0633478Z libcublas conda-forge/linux-64::libcublas-12.8.3.14-h9ab20c4_0 2025-05-07T20:25:28.0634020Z libcublas-dev conda-forge/linux-64::libcublas-dev-12.8.3.14-h9ab20c4_0 2025-05-07T20:25:28.0634561Z libcufft conda-forge/linux-64::libcufft-11.3.3.41-hbd13f7d_0 2025-05-07T20:25:28.0635083Z libcufft-dev conda-forge/linux-64::libcufft-dev-11.3.3.41-h5888daf_0 2025-05-07T20:25:28.0635646Z libcufile conda-forge/linux-64::libcufile-1.13.0.11-h12f29b5_0 2025-05-07T20:25:28.0636265Z libcufile-dev conda-forge/linux-64::libcufile-dev-1.13.0.11-h5888daf_0 2025-05-07T20:25:28.0636809Z libcurand conda-forge/linux-64::libcurand-10.3.9.55-hbd13f7d_0 2025-05-07T20:25:28.0637360Z libcurand-dev conda-forge/linux-64::libcurand-dev-10.3.9.55-h5888daf_0 2025-05-07T20:25:28.0637921Z libcusolver conda-forge/linux-64::libcusolver-11.7.2.55-h9ab20c4_0 2025-05-07T20:25:28.0638492Z libcusolver-dev conda-forge/linux-64::libcusolver-dev-11.7.2.55-h9ab20c4_0 2025-05-07T20:25:28.0639071Z libcusparse conda-forge/linux-64::libcusparse-12.5.7.53-hbd13f7d_0 2025-05-07T20:25:28.0639654Z libcusparse-dev conda-forge/linux-64::libcusparse-dev-12.5.7.53-h5888daf_0 2025-05-07T20:25:28.0640213Z libedit conda-forge/linux-64::libedit-3.1.20191231-he28a2e2_2 2025-05-07T20:25:28.0640694Z libexpat conda-forge/linux-64::libexpat-2.7.0-h5888daf_0 2025-05-07T20:25:28.0641197Z libfreetype conda-forge/linux-64::libfreetype-2.13.3-ha770c72_1 2025-05-07T20:25:28.0641734Z libfreetype6 conda-forge/linux-64::libfreetype6-2.13.3-h48d6fc4_1 2025-05-07T20:25:28.0642282Z libgcrypt-lib conda-forge/linux-64::libgcrypt-lib-1.11.0-hb9d3cd8_2 2025-05-07T20:25:28.0642897Z libglib conda-forge/linux-64::libglib-2.84.0-h2ff4ddf_0 2025-05-07T20:25:28.0643367Z libglvnd conda-forge/linux-64::libglvnd-1.7.0-ha4b6fd6_2 2025-05-07T20:25:28.0643862Z libgpg-error conda-forge/linux-64::libgpg-error-1.55-h3f2d84a_0 2025-05-07T20:25:28.0644353Z libiconv conda-forge/linux-64::libiconv-1.18-h4ce23a2_1 2025-05-07T20:25:28.0644808Z libnl conda-forge/linux-64::libnl-3.11.0-hb9d3cd8_0 2025-05-07T20:25:28.0645258Z libnpp conda-forge/linux-64::libnpp-12.3.3.65-hbd13f7d_0 2025-05-07T20:25:28.0645752Z libnpp-dev conda-forge/linux-64::libnpp-dev-12.3.3.65-h5888daf_0 2025-05-07T20:25:28.0646230Z libnsl conda-forge/linux-64::libnsl-2.0.1-hd590300_0 2025-05-07T20:25:28.0646675Z libnuma conda-forge/linux-64::libnuma-2.0.18-h4ab18f5_2 2025-05-07T20:25:28.0647170Z libnvfatbin conda-forge/linux-64::libnvfatbin-12.8.55-hbd13f7d_0 2025-05-07T20:25:28.0647821Z libnvfatbin-dev conda-forge/linux-64::libnvfatbin-dev-12.8.55-h5888daf_0 2025-05-07T20:25:28.0648385Z libnvjitlink conda-forge/linux-64::libnvjitlink-12.8.61-hbd13f7d_0 2025-05-07T20:25:28.0648968Z libnvjitlink-dev conda-forge/linux-64::libnvjitlink-dev-12.8.61-h5888daf_0 2025-05-07T20:25:28.0649536Z libnvjpeg conda-forge/linux-64::libnvjpeg-12.3.5.57-h97fd463_0 2025-05-07T20:25:28.0650079Z libnvjpeg-dev conda-forge/linux-64::libnvjpeg-dev-12.3.5.57-ha770c72_0 2025-05-07T20:25:28.0650607Z libopengl conda-forge/linux-64::libopengl-1.7.0-ha4b6fd6_2 2025-05-07T20:25:28.0651124Z libpng conda-forge/linux-64::libpng-1.6.47-h943b412_0 2025-05-07T20:25:28.0651589Z libsqlite conda-forge/linux-64::libsqlite-3.49.2-hee588c1_0 2025-05-07T20:25:28.0652091Z libsystemd0 conda-forge/linux-64::libsystemd0-256.9-h2774228_0 2025-05-07T20:25:28.0652574Z libudev1 conda-forge/linux-64::libudev1-257.4-h9a4d06a_0 2025-05-07T20:25:28.0653044Z libuuid conda-forge/linux-64::libuuid-2.38.1-h0b41bf4_0 2025-05-07T20:25:28.0653500Z libxcb conda-forge/linux-64::libxcb-1.17.0-h8a09558_0 2025-05-07T20:25:28.0653984Z libxkbcommon conda-forge/linux-64::libxkbcommon-1.8.0-hc4a0caf_0 2025-05-07T20:25:28.0654504Z libxkbfile conda-forge/linux-64::libxkbfile-1.1.0-h166bdaf_1 2025-05-07T20:25:28.0654980Z libxml2 conda-forge/linux-64::libxml2-2.13.5-h064dc61_0 2025-05-07T20:25:28.0655437Z libzlib conda-forge/linux-64::libzlib-1.3.1-hb9d3cd8_2 2025-05-07T20:25:28.0655878Z lz4-c conda-forge/linux-64::lz4-c-1.9.4-hcb278e6_0 2025-05-07T20:25:28.0656393Z nsight-compute conda-forge/linux-64::nsight-compute-2025.1.0.14-hb5ebaad_0 2025-05-07T20:25:28.0656914Z nspr conda-forge/linux-64::nspr-4.36-h5888daf_0 2025-05-07T20:25:28.0657316Z nss conda-forge/linux-64::nss-3.111-h159eef7_0 2025-05-07T20:25:28.0657732Z ocl-icd conda-forge/linux-64::ocl-icd-2.3.3-hb9d3cd8_0 2025-05-07T20:25:28.0658268Z opencl-headers conda-forge/linux-64::opencl-headers-2024.10.24-h5888daf_0 2025-05-07T20:25:28.0658790Z pcre2 conda-forge/linux-64::pcre2-10.44-hc749103_2 2025-05-07T20:25:28.0659280Z pthread-stubs conda-forge/linux-64::pthread-stubs-0.4-hb9d3cd8_1002 2025-05-07T20:25:28.0668361Z rdma-core conda-forge/linux-64::rdma-core-55.0-h5888daf_0 2025-05-07T20:25:28.0668902Z wayland conda-forge/linux-64::wayland-1.23.1-h3e06ad9_0 2025-05-07T20:25:28.0669376Z xcb-util conda-forge/linux-64::xcb-util-0.4.1-hb711507_2 2025-05-07T20:25:28.0670021Z xcb-util-cursor conda-forge/linux-64::xcb-util-cursor-0.1.5-hb9d3cd8_0 2025-05-07T20:25:28.0670600Z xcb-util-image conda-forge/linux-64::xcb-util-image-0.4.0-hb711507_2 2025-05-07T20:25:28.0671236Z xcb-util-keysyms conda-forge/linux-64::xcb-util-keysyms-0.4.1-hb711507_0 2025-05-07T20:25:28.0672017Z xcb-util-renderut~ conda-forge/linux-64::xcb-util-renderutil-0.3.10-hb711507_0 2025-05-07T20:25:28.0672604Z xcb-util-wm conda-forge/linux-64::xcb-util-wm-0.4.2-hb711507_0 2025-05-07T20:25:28.0673166Z xkeyboard-config conda-forge/linux-64::xkeyboard-config-2.44-hb9d3cd8_0 2025-05-07T20:25:28.0673737Z xorg-libice conda-forge/linux-64::xorg-libice-1.1.2-hb9d3cd8_0 2025-05-07T20:25:28.0674251Z xorg-libsm conda-forge/linux-64::xorg-libsm-1.2.6-he73a12e_0 2025-05-07T20:25:28.0674762Z xorg-libx11 conda-forge/linux-64::xorg-libx11-1.8.12-h4f16b4b_0 2025-05-07T20:25:28.0675283Z xorg-libxau conda-forge/linux-64::xorg-libxau-1.0.12-hb9d3cd8_0 2025-05-07T20:25:28.0675870Z xorg-libxcomposite conda-forge/linux-64::xorg-libxcomposite-0.4.6-hb9d3cd8_2 2025-05-07T20:25:28.0676488Z xorg-libxdamage conda-forge/linux-64::xorg-libxdamage-1.1.6-hb9d3cd8_0 2025-05-07T20:25:28.0677063Z xorg-libxdmcp conda-forge/linux-64::xorg-libxdmcp-1.1.5-hb9d3cd8_0 2025-05-07T20:25:28.0677695Z xorg-libxext conda-forge/linux-64::xorg-libxext-1.3.6-hb9d3cd8_0 2025-05-07T20:25:28.0678239Z xorg-libxfixes conda-forge/linux-64::xorg-libxfixes-6.0.1-hb9d3cd8_0 2025-05-07T20:25:28.0678777Z xorg-libxi conda-forge/linux-64::xorg-libxi-1.8.2-hb9d3cd8_0 2025-05-07T20:25:28.0679305Z xorg-libxrandr conda-forge/linux-64::xorg-libxrandr-1.5.4-hb9d3cd8_0 2025-05-07T20:25:28.0679891Z xorg-libxrender conda-forge/linux-64::xorg-libxrender-0.9.12-hb9d3cd8_0 2025-05-07T20:25:28.0680451Z xorg-libxtst conda-forge/linux-64::xorg-libxtst-1.2.5-hb9d3cd8_3 2025-05-07T20:25:28.0680931Z zstd conda-forge/linux-64::zstd-1.5.7-hb8e6e7a_2 2025-05-07T20:25:28.0681193Z 2025-05-07T20:25:28.0681305Z The following packages will be UPDATED: 2025-05-07T20:25:28.0681516Z 2025-05-07T20:25:28.0681775Z zlib pkgs/main::zlib-1.2.13-h5eee18b_1 --> conda-forge::zlib-1.3.1-hb9d3cd8_2 2025-05-07T20:25:28.0682135Z 2025-05-07T20:25:28.0682376Z The following packages will be SUPERSEDED by a higher-priority channel: 2025-05-07T20:25:28.0682724Z 2025-05-07T20:25:28.0683362Z python pkgs/main::python-3.9.21-he870216_1 --> conda-forge::python-3.9.18-h0755675_1_cpython 2025-05-07T20:25:28.0684046Z sqlite pkgs/main::sqlite-3.45.3-h5eee18b_0 --> conda-forge::sqlite-3.32.3-hcee41ef_1 2025-05-07T20:25:28.0684667Z tk pkgs/main::tk-8.6.14-h39e8969_0 --> conda-forge::tk-8.6.13-noxft_h4845f30_101 2025-05-07T20:25:28.0685012Z 2025-05-07T20:25:28.0685016Z 2025-05-07T20:25:28.0685020Z 2025-05-07T20:25:28.0685169Z Downloading and Extracting Packages: ...working... 2025-05-07T20:25:28.0685562Z libcublas-12.8.3.14 | 460.2 MB | | 0% 2025-05-07T20:25:28.0685817Z 2025-05-07T20:25:28.0686232Z nsight-compute-2025. | 320.6 MB | | 0%  2025-05-07T20:25:28.0686496Z 2025-05-07T20:25:28.0686500Z 2025-05-07T20:25:28.0686754Z libcusparse-12.5.7.5 | 164.9 MB | | 0%  2025-05-07T20:25:28.0687034Z 2025-05-07T20:25:28.0687038Z 2025-05-07T20:25:28.0687042Z 2025-05-07T20:25:28.0687281Z libcusolver-11.7.2.5 | 156.9 MB | | 0%  2025-05-07T20:25:28.0687556Z 2025-05-07T20:25:28.0687560Z 2025-05-07T20:25:28.0687564Z 2025-05-07T20:25:28.0687568Z 2025-05-07T20:25:28.0687807Z libcufft-11.3.3.41 | 147.4 MB | | 0%  2025-05-07T20:25:28.0688083Z 2025-05-07T20:25:28.0688087Z 2025-05-07T20:25:28.0688091Z 2025-05-07T20:25:28.0688094Z 2025-05-07T20:25:28.0688098Z 2025-05-07T20:25:28.0688339Z libnpp-12.3.3.65 | 130.6 MB | | 0%  2025-05-07T20:25:28.0688620Z 2025-05-07T20:25:28.0688623Z 2025-05-07T20:25:28.0688627Z 2025-05-07T20:25:28.0688631Z 2025-05-07T20:25:28.0688634Z 2025-05-07T20:25:28.0688638Z 2025-05-07T20:25:28.0688898Z cuda-nsight-12.8.55 | 113.2 MB | | 0%  2025-05-07T20:25:28.0689194Z 2025-05-07T20:25:28.0689198Z 2025-05-07T20:25:28.0689201Z 2025-05-07T20:25:28.0689378Z 2025-05-07T20:25:28.0689383Z 2025-05-07T20:25:28.0689387Z 2025-05-07T20:25:28.0689390Z 2025-05-07T20:25:28.0689644Z cuda-nvvp-12.8.57 | 112.4 MB | | 0%  2025-05-07T20:25:28.0689933Z 2025-05-07T20:25:28.0689937Z 2025-05-07T20:25:28.0689940Z 2025-05-07T20:25:28.0689944Z 2025-05-07T20:25:28.0689947Z 2025-05-07T20:25:28.0689951Z 2025-05-07T20:25:28.0689954Z 2025-05-07T20:25:28.0689958Z 2025-05-07T20:25:28.0690943Z cuda-nvrtc-12.8.61 | 63.1 MB | | 0%  2025-05-07T20:25:28.0691260Z 2025-05-07T20:25:28.0691264Z 2025-05-07T20:25:28.0691267Z 2025-05-07T20:25:28.0691271Z 2025-05-07T20:25:28.0691275Z 2025-05-07T20:25:28.0691278Z 2025-05-07T20:25:28.0691282Z 2025-05-07T20:25:28.0691285Z 2025-05-07T20:25:28.0691291Z 2025-05-07T20:25:28.0695925Z libcurand-10.3.9.55 | 43.6 MB | | 0%  2025-05-07T20:25:28.0696353Z 2025-05-07T20:25:28.0696359Z 2025-05-07T20:25:28.0696531Z 2025-05-07T20:25:28.0696540Z 2025-05-07T20:25:28.0696550Z 2025-05-07T20:25:28.0696554Z 2025-05-07T20:25:28.0696557Z 2025-05-07T20:25:28.0696561Z 2025-05-07T20:25:28.0696565Z 2025-05-07T20:25:28.0696568Z 2025-05-07T20:25:28.0697148Z gds-tools-1.13.0.11 | 37.9 MB | | 0%  2025-05-07T20:25:28.0697508Z 2025-05-07T20:25:28.0697512Z 2025-05-07T20:25:28.0697515Z 2025-05-07T20:25:28.0697519Z 2025-05-07T20:25:28.0697522Z 2025-05-07T20:25:28.0697526Z 2025-05-07T20:25:28.0697530Z 2025-05-07T20:25:28.0697533Z 2025-05-07T20:25:28.0697537Z 2025-05-07T20:25:28.0697540Z 2025-05-07T20:25:28.0697544Z 2025-05-07T20:25:28.0698452Z libnvjitlink-12.8.61 | 28.7 MB | | 0%  2025-05-07T20:25:28.0698845Z 2025-05-07T20:25:28.0698849Z 2025-05-07T20:25:28.0698853Z 2025-05-07T20:25:28.0698856Z 2025-05-07T20:25:28.0698860Z 2025-05-07T20:25:28.0698864Z 2025-05-07T20:25:28.0698867Z 2025-05-07T20:25:28.0698871Z 2025-05-07T20:25:28.0698888Z 2025-05-07T20:25:28.0698892Z 2025-05-07T20:25:28.0698896Z 2025-05-07T20:25:28.0698899Z 2025-05-07T20:25:28.0699737Z cuda-nvcc-tools-12.8 | 24.5 MB | | 0%  2025-05-07T20:25:28.0700056Z 2025-05-07T20:25:28.0700064Z 2025-05-07T20:25:28.0700068Z 2025-05-07T20:25:28.0700071Z 2025-05-07T20:25:28.0700075Z 2025-05-07T20:25:28.0700079Z 2025-05-07T20:25:28.0700082Z 2025-05-07T20:25:28.0700093Z 2025-05-07T20:25:28.0700097Z 2025-05-07T20:25:28.0700101Z 2025-05-07T20:25:28.0700104Z 2025-05-07T20:25:28.0700108Z 2025-05-07T20:25:28.0700112Z 2025-05-07T20:25:28.0701038Z cuda-nvvm-tools-12.8 | 23.5 MB | | 0%  2025-05-07T20:25:28.0701384Z 2025-05-07T20:25:28.0701388Z 2025-05-07T20:25:28.0701392Z 2025-05-07T20:25:28.0701396Z 2025-05-07T20:25:28.0701399Z 2025-05-07T20:25:28.0701411Z 2025-05-07T20:25:28.0701415Z 2025-05-07T20:25:28.0701418Z 2025-05-07T20:25:28.0701422Z 2025-05-07T20:25:28.0701440Z 2025-05-07T20:25:28.0701445Z 2025-05-07T20:25:28.0701448Z 2025-05-07T20:25:28.0701452Z 2025-05-07T20:25:28.0701461Z 2025-05-07T20:25:28.0701966Z python-3.9.18 | 22.7 MB | | 0%  2025-05-07T20:25:28.0702268Z 2025-05-07T20:25:28.0702273Z 2025-05-07T20:25:28.0702276Z 2025-05-07T20:25:28.0702280Z 2025-05-07T20:25:28.0702283Z 2025-05-07T20:25:28.0702287Z 2025-05-07T20:25:28.0702291Z 2025-05-07T20:25:28.0702297Z 2025-05-07T20:25:28.0702489Z 2025-05-07T20:25:28.0702493Z 2025-05-07T20:25:28.0702507Z 2025-05-07T20:25:28.0702518Z 2025-05-07T20:25:28.0702522Z 2025-05-07T20:25:28.0702525Z 2025-05-07T20:25:28.0702645Z 2025-05-07T20:25:28.0703175Z cuda-nvvm-impl-12.8. | 20.8 MB | | 0%  2025-05-07T20:25:28.0703603Z 2025-05-07T20:25:28.0703608Z 2025-05-07T20:25:28.0703622Z 2025-05-07T20:25:28.0703627Z 2025-05-07T20:25:28.0703637Z 2025-05-07T20:25:28.0703652Z 2025-05-07T20:25:28.0703816Z 2025-05-07T20:25:28.0703823Z 2025-05-07T20:25:28.0703828Z 2025-05-07T20:25:28.0703833Z 2025-05-07T20:25:28.0703838Z 2025-05-07T20:25:28.0703843Z 2025-05-07T20:25:28.0703848Z 2025-05-07T20:25:28.0703853Z 2025-05-07T20:25:28.0703858Z 2025-05-07T20:25:28.0703863Z 2025-05-07T20:25:28.0704716Z cuda-nvcc-dev_linux- | 12.7 MB | | 0%  2025-05-07T20:25:28.0705174Z 2025-05-07T20:25:28.0705180Z 2025-05-07T20:25:28.0705185Z 2025-05-07T20:25:28.0705190Z 2025-05-07T20:25:28.0705195Z 2025-05-07T20:25:28.0705201Z 2025-05-07T20:25:28.0705214Z 2025-05-07T20:25:28.0705219Z 2025-05-07T20:25:28.0705224Z 2025-05-07T20:25:28.0705229Z 2025-05-07T20:25:28.0705234Z 2025-05-07T20:25:28.0705239Z 2025-05-07T20:25:28.0705244Z 2025-05-07T20:25:28.0705249Z 2025-05-07T20:25:28.0705254Z 2025-05-07T20:25:28.0705267Z 2025-05-07T20:25:28.0705273Z 2025-05-07T20:25:28.0706081Z cuda-sanitizer-api-1 | 8.8 MB | | 0%  2025-05-07T20:25:28.0706776Z 2025-05-07T20:25:28.0706804Z 2025-05-07T20:25:28.0706809Z 2025-05-07T20:25:28.0706815Z 2025-05-07T20:25:28.0706820Z 2025-05-07T20:25:28.0706825Z 2025-05-07T20:25:28.0706830Z 2025-05-07T20:25:28.0706835Z 2025-05-07T20:25:28.0706840Z 2025-05-07T20:25:28.0706845Z 2025-05-07T20:25:28.0706850Z 2025-05-07T20:25:28.0706855Z 2025-05-07T20:25:28.0706860Z 2025-05-07T20:25:28.0706865Z 2025-05-07T20:25:28.0706870Z 2025-05-07T20:25:28.0706875Z 2025-05-07T20:25:28.0706881Z 2025-05-07T20:25:28.0706886Z 2025-05-07T20:25:28.0707889Z cuda-nvdisasm-12.8.5 | 4.9 MB | | 0%  2025-05-07T20:25:28.0708353Z 2025-05-07T20:25:28.0708358Z 2025-05-07T20:25:28.0708363Z 2025-05-07T20:25:28.0708369Z 2025-05-07T20:25:28.0708374Z 2025-05-07T20:25:28.0708379Z 2025-05-07T20:25:28.0708384Z 2025-05-07T20:25:28.0708389Z 2025-05-07T20:25:28.0708394Z 2025-05-07T20:25:28.0708417Z 2025-05-07T20:25:28.0708429Z 2025-05-07T20:25:28.0708434Z 2025-05-07T20:25:28.0708439Z 2025-05-07T20:25:28.0708444Z 2025-05-07T20:25:28.0708449Z 2025-05-07T20:25:28.0708454Z 2025-05-07T20:25:28.0708459Z 2025-05-07T20:25:28.0708464Z 2025-05-07T20:25:28.0708470Z 2025-05-07T20:25:28.1616355Z ... (more hidden) ... 2025-05-07T20:25:28.1617165Z 2025-05-07T20:25:28.1623966Z nsight-compute-2025. | 320.6 MB | | 0%  2025-05-07T20:25:28.1624341Z 2025-05-07T20:25:28.1626845Z 2025-05-07T20:25:28.1629496Z libcusparse-12.5.7.5 | 164.9 MB | | 0%  2025-05-07T20:25:28.1639416Z libcublas-12.8.3.14 | 460.2 MB | | 0% 2025-05-07T20:25:28.1639777Z 2025-05-07T20:25:28.1639784Z 2025-05-07T20:25:28.1639790Z 2025-05-07T20:25:28.1659241Z libcusolver-11.7.2.5 | 156.9 MB | | 1%  2025-05-07T20:25:28.1659634Z 2025-05-07T20:25:28.1659640Z 2025-05-07T20:25:28.1659645Z 2025-05-07T20:25:28.1661781Z 2025-05-07T20:25:28.2617445Z libcufft-11.3.3.41 | 147.4 MB | | 0%  2025-05-07T20:25:28.2617831Z 2025-05-07T20:25:28.2624163Z nsight-compute-2025. | 320.6 MB | 1 | 1%  2025-05-07T20:25:28.2624546Z 2025-05-07T20:25:28.2629361Z 2025-05-07T20:25:28.2633054Z libcusparse-12.5.7.5 | 164.9 MB | 2 | 2%  2025-05-07T20:25:28.2640804Z libcublas-12.8.3.14 | 460.2 MB | | 1% 2025-05-07T20:25:28.2641162Z 2025-05-07T20:25:28.2641168Z 2025-05-07T20:25:28.2642646Z 2025-05-07T20:25:28.2665465Z libcusolver-11.7.2.5 | 156.9 MB | 2 | 3%  2025-05-07T20:25:28.2665860Z 2025-05-07T20:25:28.2665866Z 2025-05-07T20:25:28.2665872Z 2025-05-07T20:25:28.2667431Z 2025-05-07T20:25:28.3624180Z libcufft-11.3.3.41 | 147.4 MB | 2 | 2%  2025-05-07T20:25:28.3624574Z 2025-05-07T20:25:28.3628048Z nsight-compute-2025. | 320.6 MB | 2 | 2%  2025-05-07T20:25:28.3628899Z libcublas-12.8.3.14 | 460.2 MB | 1 | 1% 2025-05-07T20:25:28.3629286Z 2025-05-07T20:25:28.3629297Z 2025-05-07T20:25:28.3645809Z libcusparse-12.5.7.5 | 164.9 MB | 4 | 5%  2025-05-07T20:25:28.3646195Z 2025-05-07T20:25:28.3646200Z 2025-05-07T20:25:28.3646205Z 2025-05-07T20:25:28.3666942Z libcusolver-11.7.2.5 | 156.9 MB | 4 | 5%  2025-05-07T20:25:28.3667340Z 2025-05-07T20:25:28.3667346Z 2025-05-07T20:25:28.3667351Z 2025-05-07T20:25:28.3667661Z 2025-05-07T20:25:28.4625684Z libcufft-11.3.3.41 | 147.4 MB | 4 | 5%  2025-05-07T20:25:28.4627303Z 2025-05-07T20:25:28.4631802Z nsight-compute-2025. | 320.6 MB | 3 | 4%  2025-05-07T20:25:28.4637855Z libcublas-12.8.3.14 | 460.2 MB | 2 | 2% 2025-05-07T20:25:28.4638203Z 2025-05-07T20:25:28.4638944Z 2025-05-07T20:25:28.4646003Z libcusparse-12.5.7.5 | 164.9 MB | 6 | 7%  2025-05-07T20:25:28.4646380Z 2025-05-07T20:25:28.4646386Z 2025-05-07T20:25:28.4646392Z 2025-05-07T20:25:28.4828718Z libcusolver-11.7.2.5 | 156.9 MB | 7 | 7%  2025-05-07T20:25:28.4829111Z 2025-05-07T20:25:28.4829117Z 2025-05-07T20:25:28.4829122Z 2025-05-07T20:25:28.4829904Z 2025-05-07T20:25:28.5627027Z libcufft-11.3.3.41 | 147.4 MB | 6 | 6%  2025-05-07T20:25:28.5627419Z 2025-05-07T20:25:28.5634627Z nsight-compute-2025. | 320.6 MB | 4 | 5%  2025-05-07T20:25:28.5644130Z libcublas-12.8.3.14 | 460.2 MB | 2 | 3% 2025-05-07T20:25:28.5644471Z 2025-05-07T20:25:28.5645783Z 2025-05-07T20:25:28.5654854Z libcusparse-12.5.7.5 | 164.9 MB | 8 | 9%  2025-05-07T20:25:28.5655227Z 2025-05-07T20:25:28.5655233Z 2025-05-07T20:25:28.5655467Z 2025-05-07T20:25:28.5832984Z libcusolver-11.7.2.5 | 156.9 MB | 9 | 10%  2025-05-07T20:25:28.5833386Z 2025-05-07T20:25:28.5833392Z 2025-05-07T20:25:28.5833397Z 2025-05-07T20:25:28.5834134Z 2025-05-07T20:25:28.6640804Z libcufft-11.3.3.41 | 147.4 MB | 8 | 9%  2025-05-07T20:25:28.6650082Z libcublas-12.8.3.14 | 460.2 MB | 3 | 4% 2025-05-07T20:25:28.6650437Z 2025-05-07T20:25:28.6652334Z 2025-05-07T20:25:28.6659286Z libcusparse-12.5.7.5 | 164.9 MB | # | 11%  2025-05-07T20:25:28.6659662Z 2025-05-07T20:25:28.6659676Z 2025-05-07T20:25:28.6661108Z 2025-05-07T20:25:28.6663698Z libcusolver-11.7.2.5 | 156.9 MB | #1 | 12%  2025-05-07T20:25:28.6665799Z 2025-05-07T20:25:28.6840754Z nsight-compute-2025. | 320.6 MB | 5 | 6%  2025-05-07T20:25:28.6841143Z 2025-05-07T20:25:28.6841148Z 2025-05-07T20:25:28.6841154Z 2025-05-07T20:25:28.6844085Z 2025-05-07T20:25:28.7642937Z libcufft-11.3.3.41 | 147.4 MB | # | 11%  2025-05-07T20:25:28.7655597Z libcublas-12.8.3.14 | 460.2 MB | 4 | 4% 2025-05-07T20:25:28.7655942Z 2025-05-07T20:25:28.7657609Z 2025-05-07T20:25:28.7662223Z libcusparse-12.5.7.5 | 164.9 MB | #2 | 13%  2025-05-07T20:25:28.7662605Z 2025-05-07T20:25:28.7662635Z 2025-05-07T20:25:28.7662641Z 2025-05-07T20:25:28.7666155Z libcusolver-11.7.2.5 | 156.9 MB | #4 | 14%  2025-05-07T20:25:28.7666541Z 2025-05-07T20:25:28.7841549Z nsight-compute-2025. | 320.6 MB | 6 | 7%  2025-05-07T20:25:28.7841927Z 2025-05-07T20:25:28.7841933Z 2025-05-07T20:25:28.7841938Z 2025-05-07T20:25:28.7841943Z 2025-05-07T20:25:28.8666252Z libcufft-11.3.3.41 | 147.4 MB | #2 | 13%  2025-05-07T20:25:28.8666640Z 2025-05-07T20:25:28.8666646Z 2025-05-07T20:25:28.8668672Z 2025-05-07T20:25:28.8671519Z libcusolver-11.7.2.5 | 156.9 MB | #6 | 16%  2025-05-07T20:25:28.8672274Z 2025-05-07T20:25:28.8710089Z nsight-compute-2025. | 320.6 MB | 7 | 8%  2025-05-07T20:25:28.8741318Z libcublas-12.8.3.14 | 460.2 MB | 5 | 5% 2025-05-07T20:25:28.8741678Z 2025-05-07T20:25:28.8741683Z 2025-05-07T20:25:28.8843237Z libcusparse-12.5.7.5 | 164.9 MB | #4 | 15%  2025-05-07T20:25:28.8843643Z 2025-05-07T20:25:28.8843920Z 2025-05-07T20:25:28.8843926Z 2025-05-07T20:25:28.8843932Z 2025-05-07T20:25:28.9669430Z libcufft-11.3.3.41 | 147.4 MB | #5 | 15%  2025-05-07T20:25:28.9669892Z 2025-05-07T20:25:28.9669896Z 2025-05-07T20:25:28.9671229Z 2025-05-07T20:25:28.9710031Z libcusolver-11.7.2.5 | 156.9 MB | #8 | 19%  2025-05-07T20:25:28.9710430Z 2025-05-07T20:25:28.9742213Z nsight-compute-2025. | 320.6 MB | 8 | 9%  2025-05-07T20:25:28.9742599Z 2025-05-07T20:25:28.9746730Z 2025-05-07T20:25:28.9749251Z libcusparse-12.5.7.5 | 164.9 MB | #7 | 17%  2025-05-07T20:25:28.9843250Z libcublas-12.8.3.14 | 460.2 MB | 5 | 6% 2025-05-07T20:25:28.9843587Z 2025-05-07T20:25:28.9843598Z 2025-05-07T20:25:28.9843604Z 2025-05-07T20:25:28.9847011Z 2025-05-07T20:25:29.0669663Z libcufft-11.3.3.41 | 147.4 MB | #7 | 17%  2025-05-07T20:25:29.0670055Z 2025-05-07T20:25:29.0670328Z 2025-05-07T20:25:29.0670815Z 2025-05-07T20:25:29.0748916Z libcusolver-11.7.2.5 | 156.9 MB | ##1 | 21%  2025-05-07T20:25:29.0839365Z libcublas-12.8.3.14 | 460.2 MB | 6 | 7% 2025-05-07T20:25:29.0839686Z 2025-05-07T20:25:29.0840394Z 2025-05-07T20:25:29.0848391Z libcusparse-12.5.7.5 | 164.9 MB | #9 | 19%  2025-05-07T20:25:29.0848742Z 2025-05-07T20:25:29.0848748Z 2025-05-07T20:25:29.0848754Z 2025-05-07T20:25:29.0848759Z 2025-05-07T20:25:29.0855666Z libcufft-11.3.3.41 | 147.4 MB | #9 | 19%  2025-05-07T20:25:29.0859663Z 2025-05-07T20:25:29.1674100Z nsight-compute-2025. | 320.6 MB | 9 | 10%  2025-05-07T20:25:29.1674398Z 2025-05-07T20:25:29.1674402Z 2025-05-07T20:25:29.1675057Z 2025-05-07T20:25:29.1777826Z libcusolver-11.7.2.5 | 156.9 MB | ##3 | 23%  2025-05-07T20:25:29.1841515Z libcublas-12.8.3.14 | 460.2 MB | 7 | 8% 2025-05-07T20:25:29.1841792Z 2025-05-07T20:25:29.1844800Z 2025-05-07T20:25:29.1855863Z libcusparse-12.5.7.5 | 164.9 MB | ## | 21%  2025-05-07T20:25:29.1856161Z 2025-05-07T20:25:29.1856165Z 2025-05-07T20:25:29.1856169Z 2025-05-07T20:25:29.1856781Z 2025-05-07T20:25:29.1881889Z libcufft-11.3.3.41 | 147.4 MB | ##1 | 22%  2025-05-07T20:25:29.1887034Z 2025-05-07T20:25:29.2678018Z nsight-compute-2025. | 320.6 MB | # | 11%  2025-05-07T20:25:29.2678429Z 2025-05-07T20:25:29.2678434Z 2025-05-07T20:25:29.2679822Z 2025-05-07T20:25:29.2847300Z libcusolver-11.7.2.5 | 156.9 MB | ##5 | 26%  2025-05-07T20:25:29.2847827Z 2025-05-07T20:25:29.2847845Z 2025-05-07T20:25:29.2855711Z libcusparse-12.5.7.5 | 164.9 MB | ##2 | 23%  2025-05-07T20:25:29.2855999Z 2025-05-07T20:25:29.2856003Z 2025-05-07T20:25:29.2856007Z 2025-05-07T20:25:29.2857182Z 2025-05-07T20:25:29.2874582Z libcufft-11.3.3.41 | 147.4 MB | ##3 | 24%  2025-05-07T20:25:29.2931990Z libcublas-12.8.3.14 | 460.2 MB | 8 | 8% 2025-05-07T20:25:29.2934330Z 2025-05-07T20:25:29.3882169Z nsight-compute-2025. | 320.6 MB | #1 | 12%  2025-05-07T20:25:29.3883136Z libcublas-12.8.3.14 | 460.2 MB | 9 | 9% 2025-05-07T20:25:29.3883482Z 2025-05-07T20:25:29.3886483Z 2025-05-07T20:25:29.3887559Z libcusparse-12.5.7.5 | 164.9 MB | ##4 | 25%  2025-05-07T20:25:29.3887906Z 2025-05-07T20:25:29.3887917Z 2025-05-07T20:25:29.3889645Z 2025-05-07T20:25:29.3938472Z libcusolver-11.7.2.5 | 156.9 MB | ##8 | 28%  2025-05-07T20:25:29.3938774Z 2025-05-07T20:25:29.3938778Z 2025-05-07T20:25:29.3938789Z 2025-05-07T20:25:29.3940129Z 2025-05-07T20:25:29.3973227Z libcufft-11.3.3.41 | 147.4 MB | ##6 | 26%  2025-05-07T20:25:29.3976030Z 2025-05-07T20:25:29.4927578Z nsight-compute-2025. | 320.6 MB | #2 | 13%  2025-05-07T20:25:29.4927864Z 2025-05-07T20:25:29.4927868Z 2025-05-07T20:25:29.4984999Z libcusparse-12.5.7.5 | 164.9 MB | ##6 | 27%  2025-05-07T20:25:29.4985317Z 2025-05-07T20:25:29.4985787Z 2025-05-07T20:25:29.4988156Z 2025-05-07T20:25:29.4992063Z libcusolver-11.7.2.5 | 156.9 MB | ### | 30%  2025-05-07T20:25:29.4992360Z 2025-05-07T20:25:29.4992364Z 2025-05-07T20:25:29.4992368Z 2025-05-07T20:25:29.4993132Z 2025-05-07T20:25:29.5000551Z libcufft-11.3.3.41 | 147.4 MB | ##8 | 28%  2025-05-07T20:25:29.5088402Z libcublas-12.8.3.14 | 460.2 MB | 9 | 10% 2025-05-07T20:25:29.5090504Z 2025-05-07T20:25:29.5927326Z nsight-compute-2025. | 320.6 MB | #3 | 14%  2025-05-07T20:25:29.5927638Z 2025-05-07T20:25:29.5927642Z 2025-05-07T20:25:29.5994818Z libcusparse-12.5.7.5 | 164.9 MB | ##8 | 29%  2025-05-07T20:25:29.5995104Z 2025-05-07T20:25:29.5995108Z 2025-05-07T20:25:29.5995112Z 2025-05-07T20:25:29.5995583Z 2025-05-07T20:25:29.6044986Z libcufft-11.3.3.41 | 147.4 MB | ### | 30%  2025-05-07T20:25:29.6045396Z 2025-05-07T20:25:29.6045401Z 2025-05-07T20:25:29.6045998Z 2025-05-07T20:25:29.6061660Z libcusolver-11.7.2.5 | 156.9 MB | ###2 | 33%  2025-05-07T20:25:29.6088879Z libcublas-12.8.3.14 | 460.2 MB | # | 11% 2025-05-07T20:25:29.6090738Z 2025-05-07T20:25:29.6932230Z nsight-compute-2025. | 320.6 MB | #5 | 15%  2025-05-07T20:25:29.6932644Z 2025-05-07T20:25:29.6934104Z 2025-05-07T20:25:29.7003389Z libcusparse-12.5.7.5 | 164.9 MB | ###1 | 31%  2025-05-07T20:25:29.7004083Z 2025-05-07T20:25:29.7004086Z 2025-05-07T20:25:29.7004090Z 2025-05-07T20:25:29.7004094Z 2025-05-07T20:25:29.7082001Z libcufft-11.3.3.41 | 147.4 MB | ###2 | 33%  2025-05-07T20:25:29.7082297Z 2025-05-07T20:25:29.7082301Z 2025-05-07T20:25:29.7082305Z 2025-05-07T20:25:29.7089988Z libcusolver-11.7.2.5 | 156.9 MB | ###4 | 35%  2025-05-07T20:25:29.7090434Z 2025-05-07T20:25:29.7543672Z nsight-compute-2025. | 320.6 MB | #6 | 16%  2025-05-07T20:25:29.7934123Z libcublas-12.8.3.14 | 460.2 MB | #1 | 11% 2025-05-07T20:25:29.7934416Z 2025-05-07T20:25:29.7934421Z 2025-05-07T20:25:29.8004076Z libcusparse-12.5.7.5 | 164.9 MB | ###4 | 35%  2025-05-07T20:25:29.8004411Z 2025-05-07T20:25:29.8004417Z 2025-05-07T20:25:29.8004422Z 2025-05-07T20:25:29.8004427Z 2025-05-07T20:25:29.8086296Z libcufft-11.3.3.41 | 147.4 MB | ###5 | 36%  2025-05-07T20:25:29.8086650Z 2025-05-07T20:25:29.8086655Z 2025-05-07T20:25:29.8086659Z 2025-05-07T20:25:29.8091625Z libcusolver-11.7.2.5 | 156.9 MB | ###8 | 38%  2025-05-07T20:25:29.8092026Z 2025-05-07T20:25:29.8615348Z nsight-compute-2025. | 320.6 MB | #7 | 17%  2025-05-07T20:25:29.9007798Z libcublas-12.8.3.14 | 460.2 MB | #1 | 12% 2025-05-07T20:25:29.9008087Z 2025-05-07T20:25:29.9008092Z 2025-05-07T20:25:29.9008095Z 2025-05-07T20:25:29.9008099Z 2025-05-07T20:25:29.9093760Z libcufft-11.3.3.41 | 147.4 MB | ###8 | 38%  2025-05-07T20:25:29.9094110Z 2025-05-07T20:25:29.9129436Z nsight-compute-2025. | 320.6 MB | #8 | 18%  2025-05-07T20:25:29.9129722Z 2025-05-07T20:25:29.9130256Z 2025-05-07T20:25:29.9222090Z libcusparse-12.5.7.5 | 164.9 MB | ###7 | 37%  2025-05-07T20:25:29.9222500Z 2025-05-07T20:25:29.9222507Z 2025-05-07T20:25:29.9223218Z 2025-05-07T20:25:29.9620297Z libcusolver-11.7.2.5 | 156.9 MB | ####1 | 41%  2025-05-07T20:25:30.0009902Z libcublas-12.8.3.14 | 460.2 MB | #2 | 13% 2025-05-07T20:25:30.0010226Z 2025-05-07T20:25:30.0010231Z 2025-05-07T20:25:30.0010235Z 2025-05-07T20:25:30.0010238Z 2025-05-07T20:25:30.0097220Z libcufft-11.3.3.41 | 147.4 MB | #### | 41%  2025-05-07T20:25:30.0097517Z 2025-05-07T20:25:30.0189472Z nsight-compute-2025. | 320.6 MB | #9 | 20%  2025-05-07T20:25:30.0189936Z 2025-05-07T20:25:30.0189940Z 2025-05-07T20:25:30.0236244Z libcusparse-12.5.7.5 | 164.9 MB | ###9 | 40%  2025-05-07T20:25:30.0236616Z 2025-05-07T20:25:30.0236635Z 2025-05-07T20:25:30.0236871Z 2025-05-07T20:25:30.0623751Z libcusolver-11.7.2.5 | 156.9 MB | ####3 | 44%  2025-05-07T20:25:30.1101435Z libcublas-12.8.3.14 | 460.2 MB | #3 | 13% 2025-05-07T20:25:30.1101748Z 2025-05-07T20:25:30.1191667Z nsight-compute-2025. | 320.6 MB | ## | 21%  2025-05-07T20:25:30.1192020Z 2025-05-07T20:25:30.1192027Z 2025-05-07T20:25:30.1253817Z libcusparse-12.5.7.5 | 164.9 MB | ####1 | 42%  2025-05-07T20:25:30.1254150Z 2025-05-07T20:25:30.1254155Z 2025-05-07T20:25:30.1255769Z 2025-05-07T20:25:30.1312324Z libcusolver-11.7.2.5 | 156.9 MB | ####6 | 46%  2025-05-07T20:25:30.1312630Z 2025-05-07T20:25:30.1312634Z 2025-05-07T20:25:30.1312638Z 2025-05-07T20:25:30.1313128Z 2025-05-07T20:25:30.1629472Z libcufft-11.3.3.41 | 147.4 MB | ####3 | 43%  2025-05-07T20:25:30.2103726Z libcublas-12.8.3.14 | 460.2 MB | #4 | 14% 2025-05-07T20:25:30.2104249Z 2025-05-07T20:25:30.2192771Z nsight-compute-2025. | 320.6 MB | ##1 | 22%  2025-05-07T20:25:30.2193289Z 2025-05-07T20:25:30.2193871Z 2025-05-07T20:25:30.2255982Z libcusparse-12.5.7.5 | 164.9 MB | ####4 | 45%  2025-05-07T20:25:30.2256389Z 2025-05-07T20:25:30.2256394Z 2025-05-07T20:25:30.2257804Z 2025-05-07T20:25:30.2634350Z libcusolver-11.7.2.5 | 156.9 MB | ####8 | 49%  2025-05-07T20:25:30.3025012Z libcublas-12.8.3.14 | 460.2 MB | #5 | 15% 2025-05-07T20:25:30.3025572Z 2025-05-07T20:25:30.3025577Z 2025-05-07T20:25:30.3025581Z 2025-05-07T20:25:30.3027787Z 2025-05-07T20:25:30.3105491Z libcufft-11.3.3.41 | 147.4 MB | ####5 | 45%  2025-05-07T20:25:30.3107402Z 2025-05-07T20:25:30.3193930Z nsight-compute-2025. | 320.6 MB | ##3 | 23%  2025-05-07T20:25:30.3194214Z 2025-05-07T20:25:30.3195665Z 2025-05-07T20:25:30.3323805Z libcusparse-12.5.7.5 | 164.9 MB | ####6 | 47%  2025-05-07T20:25:30.3324136Z 2025-05-07T20:25:30.3324140Z 2025-05-07T20:25:30.3324604Z 2025-05-07T20:25:30.3637253Z libcusolver-11.7.2.5 | 156.9 MB | #####1 | 51%  2025-05-07T20:25:30.4029197Z libcublas-12.8.3.14 | 460.2 MB | #5 | 16% 2025-05-07T20:25:30.4029587Z 2025-05-07T20:25:30.4029594Z 2025-05-07T20:25:30.4029600Z 2025-05-07T20:25:30.4033907Z 2025-05-07T20:25:30.4167101Z libcufft-11.3.3.41 | 147.4 MB | ####7 | 48%  2025-05-07T20:25:30.4173597Z 2025-05-07T20:25:30.4364432Z nsight-compute-2025. | 320.6 MB | ##4 | 24%  2025-05-07T20:25:30.4364725Z 2025-05-07T20:25:30.4365939Z 2025-05-07T20:25:30.4474958Z libcusparse-12.5.7.5 | 164.9 MB | ####9 | 49%  2025-05-07T20:25:30.4475313Z 2025-05-07T20:25:30.4475318Z 2025-05-07T20:25:30.4476630Z 2025-05-07T20:25:30.4663760Z libcusolver-11.7.2.5 | 156.9 MB | #####3 | 54%  2025-05-07T20:25:30.5031148Z libcublas-12.8.3.14 | 460.2 MB | #6 | 17% 2025-05-07T20:25:30.5031428Z 2025-05-07T20:25:30.5031433Z 2025-05-07T20:25:30.5031436Z 2025-05-07T20:25:30.5031817Z 2025-05-07T20:25:30.5219453Z libcufft-11.3.3.41 | 147.4 MB | ####9 | 50%  2025-05-07T20:25:30.5219759Z 2025-05-07T20:25:30.5419967Z nsight-compute-2025. | 320.6 MB | ##5 | 25%  2025-05-07T20:25:30.5420315Z 2025-05-07T20:25:30.5420319Z 2025-05-07T20:25:30.5667937Z libcusparse-12.5.7.5 | 164.9 MB | #####1 | 52%  2025-05-07T20:25:30.6033847Z libcublas-12.8.3.14 | 460.2 MB | #7 | 18% 2025-05-07T20:25:30.6034111Z 2025-05-07T20:25:30.6034116Z 2025-05-07T20:25:30.6034119Z 2025-05-07T20:25:30.6034124Z 2025-05-07T20:25:30.6045028Z libcufft-11.3.3.41 | 147.4 MB | #####2 | 52%  2025-05-07T20:25:30.6045448Z 2025-05-07T20:25:30.6045453Z 2025-05-07T20:25:30.6047265Z 2025-05-07T20:25:30.6220695Z libcusolver-11.7.2.5 | 156.9 MB | #####6 | 56%  2025-05-07T20:25:30.6223441Z 2025-05-07T20:25:30.6485668Z nsight-compute-2025. | 320.6 MB | ##6 | 27%  2025-05-07T20:25:30.6485984Z 2025-05-07T20:25:30.6486009Z 2025-05-07T20:25:30.6792448Z libcusparse-12.5.7.5 | 164.9 MB | #####4 | 54%  2025-05-07T20:25:30.7046832Z libcublas-12.8.3.14 | 460.2 MB | #8 | 18% 2025-05-07T20:25:30.7047104Z 2025-05-07T20:25:30.7047108Z 2025-05-07T20:25:30.7051331Z 2025-05-07T20:25:30.7087457Z libcusolver-11.7.2.5 | 156.9 MB | #####8 | 58%  2025-05-07T20:25:30.7088102Z 2025-05-07T20:25:30.7088119Z 2025-05-07T20:25:30.7088125Z 2025-05-07T20:25:30.7088131Z 2025-05-07T20:25:30.7231375Z libcufft-11.3.3.41 | 147.4 MB | #####4 | 55%  2025-05-07T20:25:30.7231761Z 2025-05-07T20:25:30.7689035Z nsight-compute-2025. | 320.6 MB | ##7 | 28%  2025-05-07T20:25:30.7689381Z 2025-05-07T20:25:30.7690564Z 2025-05-07T20:25:30.7830480Z libcusparse-12.5.7.5 | 164.9 MB | #####6 | 56%  2025-05-07T20:25:30.8051988Z libcublas-12.8.3.14 | 460.2 MB | #9 | 19% 2025-05-07T20:25:30.8052337Z 2025-05-07T20:25:30.8052342Z 2025-05-07T20:25:30.8053719Z 2025-05-07T20:25:30.8104389Z libcusolver-11.7.2.5 | 156.9 MB | ###### | 61%  2025-05-07T20:25:30.8104701Z 2025-05-07T20:25:30.8104707Z 2025-05-07T20:25:30.8104712Z 2025-05-07T20:25:30.8104924Z 2025-05-07T20:25:30.8260486Z libcufft-11.3.3.41 | 147.4 MB | #####6 | 57%  2025-05-07T20:25:30.8263097Z 2025-05-07T20:25:30.8697407Z nsight-compute-2025. | 320.6 MB | ##8 | 29%  2025-05-07T20:25:30.8697756Z 2025-05-07T20:25:30.8700247Z 2025-05-07T20:25:30.8843686Z libcusparse-12.5.7.5 | 164.9 MB | #####8 | 58%  2025-05-07T20:25:30.9052126Z libcublas-12.8.3.14 | 460.2 MB | #9 | 20% 2025-05-07T20:25:30.9052395Z 2025-05-07T20:25:30.9052400Z 2025-05-07T20:25:30.9057624Z 2025-05-07T20:25:30.9106124Z libcusolver-11.7.2.5 | 156.9 MB | ######2 | 63%  2025-05-07T20:25:30.9106513Z 2025-05-07T20:25:30.9106518Z 2025-05-07T20:25:30.9106522Z 2025-05-07T20:25:30.9107151Z 2025-05-07T20:25:30.9300645Z libcufft-11.3.3.41 | 147.4 MB | #####9 | 59%  2025-05-07T20:25:30.9301650Z 2025-05-07T20:25:30.9698474Z nsight-compute-2025. | 320.6 MB | ##9 | 30%  2025-05-07T20:25:30.9698765Z 2025-05-07T20:25:30.9698770Z 2025-05-07T20:25:30.9845447Z libcusparse-12.5.7.5 | 164.9 MB | ###### | 61%  2025-05-07T20:25:31.0052578Z libcublas-12.8.3.14 | 460.2 MB | ## | 21% 2025-05-07T20:25:31.0053269Z 2025-05-07T20:25:31.0053273Z 2025-05-07T20:25:31.0053858Z 2025-05-07T20:25:31.0109820Z libcusolver-11.7.2.5 | 156.9 MB | ######5 | 65%  2025-05-07T20:25:31.0110145Z 2025-05-07T20:25:31.0110151Z 2025-05-07T20:25:31.0110156Z 2025-05-07T20:25:31.0110161Z 2025-05-07T20:25:31.0304395Z libcufft-11.3.3.41 | 147.4 MB | ######1 | 62%  2025-05-07T20:25:31.0304696Z 2025-05-07T20:25:31.0701953Z nsight-compute-2025. | 320.6 MB | ###1 | 31%  2025-05-07T20:25:31.0702248Z 2025-05-07T20:25:31.0702873Z 2025-05-07T20:25:31.0851796Z libcusparse-12.5.7.5 | 164.9 MB | ######2 | 63%  2025-05-07T20:25:31.1054562Z libcublas-12.8.3.14 | 460.2 MB | ##1 | 21% 2025-05-07T20:25:31.1054876Z 2025-05-07T20:25:31.1054881Z 2025-05-07T20:25:31.1056363Z 2025-05-07T20:25:31.1126643Z libcusolver-11.7.2.5 | 156.9 MB | ######7 | 67%  2025-05-07T20:25:31.1127025Z 2025-05-07T20:25:31.1127031Z 2025-05-07T20:25:31.1127037Z 2025-05-07T20:25:31.1127042Z 2025-05-07T20:25:31.1495314Z libcufft-11.3.3.41 | 147.4 MB | ######4 | 64%  2025-05-07T20:25:31.1495684Z 2025-05-07T20:25:31.1727012Z nsight-compute-2025. | 320.6 MB | ###2 | 32%  2025-05-07T20:25:31.1727389Z 2025-05-07T20:25:31.1727394Z 2025-05-07T20:25:31.2007640Z libcusparse-12.5.7.5 | 164.9 MB | ######4 | 65%  2025-05-07T20:25:31.2084717Z libcublas-12.8.3.14 | 460.2 MB | ##2 | 22% 2025-05-07T20:25:31.2084996Z 2025-05-07T20:25:31.2085000Z 2025-05-07T20:25:31.2085004Z 2025-05-07T20:25:31.2248752Z libcusolver-11.7.2.5 | 156.9 MB | ######9 | 70%  2025-05-07T20:25:31.2249085Z 2025-05-07T20:25:31.2249349Z 2025-05-07T20:25:31.2249354Z 2025-05-07T20:25:31.2254543Z 2025-05-07T20:25:31.2702885Z libcufft-11.3.3.41 | 147.4 MB | ######6 | 66%  2025-05-07T20:25:31.2704008Z 2025-05-07T20:25:31.2730751Z nsight-compute-2025. | 320.6 MB | ###3 | 33%  2025-05-07T20:25:31.2731068Z 2025-05-07T20:25:31.2731072Z 2025-05-07T20:25:31.3011232Z libcusparse-12.5.7.5 | 164.9 MB | ######7 | 67%  2025-05-07T20:25:31.3136636Z libcublas-12.8.3.14 | 460.2 MB | ##2 | 23% 2025-05-07T20:25:31.3136930Z 2025-05-07T20:25:31.3136935Z 2025-05-07T20:25:31.3138226Z 2025-05-07T20:25:31.3369490Z libcusolver-11.7.2.5 | 156.9 MB | #######1 | 72%  2025-05-07T20:25:31.3369794Z 2025-05-07T20:25:31.3369799Z 2025-05-07T20:25:31.3369803Z 2025-05-07T20:25:31.3372201Z 2025-05-07T20:25:31.3732268Z libcufft-11.3.3.41 | 147.4 MB | ######8 | 69%  2025-05-07T20:25:31.3732563Z 2025-05-07T20:25:31.3732574Z 2025-05-07T20:25:31.3770241Z libcusparse-12.5.7.5 | 164.9 MB | ######9 | 69%  2025-05-07T20:25:31.3771905Z 2025-05-07T20:25:31.4014417Z nsight-compute-2025. | 320.6 MB | ###4 | 34%  2025-05-07T20:25:31.4138289Z libcublas-12.8.3.14 | 460.2 MB | ##3 | 24% 2025-05-07T20:25:31.4138557Z 2025-05-07T20:25:31.4138561Z 2025-05-07T20:25:31.4139314Z 2025-05-07T20:25:31.4733911Z libcusolver-11.7.2.5 | 156.9 MB | #######4 | 74%  2025-05-07T20:25:31.4734218Z 2025-05-07T20:25:31.4734228Z 2025-05-07T20:25:31.4772961Z libcusparse-12.5.7.5 | 164.9 MB | #######1 | 72%  2025-05-07T20:25:31.4775206Z 2025-05-07T20:25:31.5016522Z nsight-compute-2025. | 320.6 MB | ###5 | 36%  2025-05-07T20:25:31.5108520Z libcublas-12.8.3.14 | 460.2 MB | ##4 | 25% 2025-05-07T20:25:31.5108780Z 2025-05-07T20:25:31.5108785Z 2025-05-07T20:25:31.5108788Z 2025-05-07T20:25:31.5108793Z 2025-05-07T20:25:31.5140817Z libcufft-11.3.3.41 | 147.4 MB | ####### | 71%  2025-05-07T20:25:31.5141151Z 2025-05-07T20:25:31.5141155Z 2025-05-07T20:25:31.5141887Z 2025-05-07T20:25:31.5836703Z libcusolver-11.7.2.5 | 156.9 MB | #######6 | 77%  2025-05-07T20:25:31.5837022Z 2025-05-07T20:25:31.5837026Z 2025-05-07T20:25:31.5871258Z libcusparse-12.5.7.5 | 164.9 MB | #######4 | 74%  2025-05-07T20:25:31.5872195Z 2025-05-07T20:25:31.6076509Z nsight-compute-2025. | 320.6 MB | ###6 | 37%  2025-05-07T20:25:31.6114970Z libcublas-12.8.3.14 | 460.2 MB | ##5 | 25% 2025-05-07T20:25:31.6115267Z 2025-05-07T20:25:31.6115271Z 2025-05-07T20:25:31.6115275Z 2025-05-07T20:25:31.6116387Z 2025-05-07T20:25:31.6212007Z libcufft-11.3.3.41 | 147.4 MB | #######2 | 73%  2025-05-07T20:25:31.6212304Z 2025-05-07T20:25:31.6212308Z 2025-05-07T20:25:31.6212312Z 2025-05-07T20:25:31.6902744Z libcusolver-11.7.2.5 | 156.9 MB | #######8 | 79%  2025-05-07T20:25:31.6903048Z 2025-05-07T20:25:31.6903053Z 2025-05-07T20:25:31.6954090Z libcusparse-12.5.7.5 | 164.9 MB | #######6 | 76%  2025-05-07T20:25:31.6954875Z 2025-05-07T20:25:31.7077836Z nsight-compute-2025. | 320.6 MB | ###7 | 38%  2025-05-07T20:25:31.7212421Z libcublas-12.8.3.14 | 460.2 MB | ##6 | 26% 2025-05-07T20:25:31.7212759Z 2025-05-07T20:25:31.7212764Z 2025-05-07T20:25:31.7212768Z 2025-05-07T20:25:31.7694387Z libcusolver-11.7.2.5 | 156.9 MB | ########1 | 81%  2025-05-07T20:25:31.7694715Z 2025-05-07T20:25:31.7694721Z 2025-05-07T20:25:31.7694742Z 2025-05-07T20:25:31.7698471Z 2025-05-07T20:25:31.7904984Z libcufft-11.3.3.41 | 147.4 MB | #######4 | 75%  2025-05-07T20:25:31.7905459Z 2025-05-07T20:25:31.7905465Z 2025-05-07T20:25:31.7955816Z libcusparse-12.5.7.5 | 164.9 MB | #######8 | 79%  2025-05-07T20:25:31.7958764Z 2025-05-07T20:25:31.8079688Z nsight-compute-2025. | 320.6 MB | ###8 | 39%  2025-05-07T20:25:31.8217627Z libcublas-12.8.3.14 | 460.2 MB | ##7 | 27% 2025-05-07T20:25:31.8217937Z 2025-05-07T20:25:31.8218242Z 2025-05-07T20:25:31.8218249Z 2025-05-07T20:25:31.8697572Z libcusolver-11.7.2.5 | 156.9 MB | ########3 | 83%  2025-05-07T20:25:31.8697996Z 2025-05-07T20:25:31.8698003Z 2025-05-07T20:25:31.8698009Z 2025-05-07T20:25:31.8698014Z 2025-05-07T20:25:31.8969609Z libcufft-11.3.3.41 | 147.4 MB | #######6 | 77%  2025-05-07T20:25:31.8969997Z 2025-05-07T20:25:31.8970002Z 2025-05-07T20:25:31.9071456Z libcusparse-12.5.7.5 | 164.9 MB | ########1 | 81%  2025-05-07T20:25:31.9072580Z 2025-05-07T20:25:31.9221142Z nsight-compute-2025. | 320.6 MB | ###9 | 40%  2025-05-07T20:25:31.9563545Z libcublas-12.8.3.14 | 460.2 MB | ##7 | 28% 2025-05-07T20:25:31.9563821Z 2025-05-07T20:25:31.9563825Z 2025-05-07T20:25:31.9563829Z 2025-05-07T20:25:31.9697579Z libcusolver-11.7.2.5 | 156.9 MB | ########5 | 86%  2025-05-07T20:25:31.9697881Z 2025-05-07T20:25:31.9697885Z 2025-05-07T20:25:31.9697889Z 2025-05-07T20:25:31.9697893Z 2025-05-07T20:25:32.0024067Z libcufft-11.3.3.41 | 147.4 MB | #######8 | 79%  2025-05-07T20:25:32.0024391Z 2025-05-07T20:25:32.0028718Z 2025-05-07T20:25:32.0186032Z libcusparse-12.5.7.5 | 164.9 MB | ########3 | 83%  2025-05-07T20:25:32.0186404Z 2025-05-07T20:25:32.0261636Z nsight-compute-2025. | 320.6 MB | ####1 | 41%  2025-05-07T20:25:32.0699605Z libcublas-12.8.3.14 | 460.2 MB | ##8 | 29% 2025-05-07T20:25:32.0699930Z 2025-05-07T20:25:32.0699942Z 2025-05-07T20:25:32.0699946Z 2025-05-07T20:25:32.0699950Z 2025-05-07T20:25:32.0711283Z libcufft-11.3.3.41 | 147.4 MB | ########1 | 81%  2025-05-07T20:25:32.0711632Z 2025-05-07T20:25:32.0711637Z 2025-05-07T20:25:32.0712139Z 2025-05-07T20:25:32.1073220Z libcusolver-11.7.2.5 | 156.9 MB | ########7 | 88%  2025-05-07T20:25:32.1073621Z 2025-05-07T20:25:32.1076296Z 2025-05-07T20:25:32.1271430Z libcusparse-12.5.7.5 | 164.9 MB | ########5 | 86%  2025-05-07T20:25:32.1274237Z 2025-05-07T20:25:32.1303696Z nsight-compute-2025. | 320.6 MB | ####2 | 42%  2025-05-07T20:25:32.1704634Z libcublas-12.8.3.14 | 460.2 MB | ##9 | 29% 2025-05-07T20:25:32.1704904Z 2025-05-07T20:25:32.1704909Z 2025-05-07T20:25:32.1704913Z 2025-05-07T20:25:32.1705801Z 2025-05-07T20:25:32.1930997Z libcufft-11.3.3.41 | 147.4 MB | ########3 | 83%  2025-05-07T20:25:32.1931311Z 2025-05-07T20:25:32.1931315Z 2025-05-07T20:25:32.1931319Z 2025-05-07T20:25:32.2078008Z libcusolver-11.7.2.5 | 156.9 MB | ########9 | 90%  2025-05-07T20:25:32.2078466Z 2025-05-07T20:25:32.2079106Z 2025-05-07T20:25:32.2275357Z libcusparse-12.5.7.5 | 164.9 MB | ########7 | 88%  2025-05-07T20:25:32.2275687Z 2025-05-07T20:25:32.2307757Z nsight-compute-2025. | 320.6 MB | ####3 | 43%  2025-05-07T20:25:32.2709147Z libcublas-12.8.3.14 | 460.2 MB | ### | 30% 2025-05-07T20:25:32.2709474Z 2025-05-07T20:25:32.2709478Z 2025-05-07T20:25:32.2709482Z 2025-05-07T20:25:32.2709920Z 2025-05-07T20:25:32.2952750Z libcufft-11.3.3.41 | 147.4 MB | ########5 | 86%  2025-05-07T20:25:32.2953072Z 2025-05-07T20:25:32.2953078Z 2025-05-07T20:25:32.2955368Z 2025-05-07T20:25:32.3078014Z libcusolver-11.7.2.5 | 156.9 MB | #########1 | 92%  2025-05-07T20:25:32.3078319Z 2025-05-07T20:25:32.3078323Z 2025-05-07T20:25:32.3284281Z libcusparse-12.5.7.5 | 164.9 MB | ######### | 90%  2025-05-07T20:25:32.3284598Z 2025-05-07T20:25:32.3310977Z nsight-compute-2025. | 320.6 MB | ####4 | 44%  2025-05-07T20:25:32.3952558Z libcublas-12.8.3.14 | 460.2 MB | ### | 31% 2025-05-07T20:25:32.3952863Z 2025-05-07T20:25:32.3952869Z 2025-05-07T20:25:32.3954808Z 2025-05-07T20:25:32.3971741Z libcusolver-11.7.2.5 | 156.9 MB | #########3 | 94%  2025-05-07T20:25:32.3972046Z 2025-05-07T20:25:32.3972051Z 2025-05-07T20:25:32.3972055Z 2025-05-07T20:25:32.3972059Z 2025-05-07T20:25:32.4150626Z libcufft-11.3.3.41 | 147.4 MB | ########7 | 88%  2025-05-07T20:25:32.4151209Z 2025-05-07T20:25:32.4152995Z 2025-05-07T20:25:32.4293588Z libcusparse-12.5.7.5 | 164.9 MB | #########2 | 92%  2025-05-07T20:25:32.4293971Z 2025-05-07T20:25:32.4312206Z nsight-compute-2025. | 320.6 MB | ####5 | 45%  2025-05-07T20:25:32.4979753Z libcublas-12.8.3.14 | 460.2 MB | ###1 | 32% 2025-05-07T20:25:32.4980091Z 2025-05-07T20:25:32.4980097Z 2025-05-07T20:25:32.4982671Z 2025-05-07T20:25:32.4991964Z libcusolver-11.7.2.5 | 156.9 MB | #########5 | 96%  2025-05-07T20:25:32.4992294Z 2025-05-07T20:25:32.4992299Z 2025-05-07T20:25:32.4992303Z 2025-05-07T20:25:32.4992307Z 2025-05-07T20:25:32.5237910Z libcufft-11.3.3.41 | 147.4 MB | ########9 | 90%  2025-05-07T20:25:32.5238204Z 2025-05-07T20:25:32.5238209Z 2025-05-07T20:25:32.5298165Z libcusparse-12.5.7.5 | 164.9 MB | #########4 | 94%  2025-05-07T20:25:32.5298547Z 2025-05-07T20:25:32.5342731Z nsight-compute-2025. | 320.6 MB | ####6 | 46%  2025-05-07T20:25:32.5993704Z libcublas-12.8.3.14 | 460.2 MB | ###2 | 33% 2025-05-07T20:25:32.5993979Z 2025-05-07T20:25:32.5993992Z 2025-05-07T20:25:32.5993996Z 2025-05-07T20:25:32.5995392Z 2025-05-07T20:25:32.6107799Z libcufft-11.3.3.41 | 147.4 MB | #########1 | 92%  2025-05-07T20:25:32.6108099Z 2025-05-07T20:25:32.6108109Z 2025-05-07T20:25:32.6108113Z 2025-05-07T20:25:32.6373644Z libcusolver-11.7.2.5 | 156.9 MB | #########7 | 97%  2025-05-07T20:25:32.6374073Z 2025-05-07T20:25:32.6374702Z 2025-05-07T20:25:32.6427753Z libcusparse-12.5.7.5 | 164.9 MB | #########6 | 96%  2025-05-07T20:25:32.6428125Z 2025-05-07T20:25:32.6436234Z nsight-compute-2025. | 320.6 MB | ####7 | 47%  2025-05-07T20:25:32.6994713Z libcublas-12.8.3.14 | 460.2 MB | ###3 | 33% 2025-05-07T20:25:32.6995094Z 2025-05-07T20:25:32.6995101Z 2025-05-07T20:25:32.6995108Z 2025-05-07T20:25:32.6995113Z 2025-05-07T20:25:32.7114014Z libcufft-11.3.3.41 | 147.4 MB | #########4 | 94%  2025-05-07T20:25:32.7114348Z 2025-05-07T20:25:32.7114352Z 2025-05-07T20:25:32.7114356Z 2025-05-07T20:25:32.7399045Z libcusolver-11.7.2.5 | 156.9 MB | #########9 | 99%  2025-05-07T20:25:32.7399508Z 2025-05-07T20:25:32.7400113Z 2025-05-07T20:25:32.7438405Z libcusparse-12.5.7.5 | 164.9 MB | #########8 | 98%  2025-05-07T20:25:32.7459296Z libcublas-12.8.3.14 | 460.2 MB | ###4 | 34% 2025-05-07T20:25:32.7461009Z 2025-05-07T20:25:32.7996999Z nsight-compute-2025. | 320.6 MB | ####8 | 48%  2025-05-07T20:25:32.7997393Z 2025-05-07T20:25:32.7997400Z 2025-05-07T20:25:32.7997406Z 2025-05-07T20:25:32.7997411Z 2025-05-07T20:25:32.8461802Z libcufft-11.3.3.41 | 147.4 MB | #########6 | 97%  2025-05-07T20:25:32.8463060Z 2025-05-07T20:25:32.8490938Z nsight-compute-2025. | 320.6 MB | ####9 | 50%  2025-05-07T20:25:32.8998084Z libcublas-12.8.3.14 | 460.2 MB | ###4 | 35% 2025-05-07T20:25:32.8998373Z 2025-05-07T20:25:32.8998377Z 2025-05-07T20:25:32.8998411Z 2025-05-07T20:25:32.8998424Z 2025-05-07T20:25:32.9464274Z libcufft-11.3.3.41 | 147.4 MB | #########9 | 99%  2025-05-07T20:25:32.9465895Z 2025-05-07T20:25:32.9492917Z nsight-compute-2025. | 320.6 MB | ##### | 51%  2025-05-07T20:25:33.0465621Z libcublas-12.8.3.14 | 460.2 MB | ###5 | 36% 2025-05-07T20:25:33.0468469Z 2025-05-07T20:25:33.0495395Z nsight-compute-2025. | 320.6 MB | #####2 | 52%  2025-05-07T20:25:33.1467767Z libcublas-12.8.3.14 | 460.2 MB | ###7 | 37% 2025-05-07T20:25:33.1468192Z 2025-05-07T20:25:33.1892283Z nsight-compute-2025. | 320.6 MB | #####4 | 54%  2025-05-07T20:25:33.2468382Z libcublas-12.8.3.14 | 460.2 MB | ###8 | 38% 2025-05-07T20:25:33.2468645Z 2025-05-07T20:25:33.2950043Z nsight-compute-2025. | 320.6 MB | #####5 | 56%  2025-05-07T20:25:33.3597718Z libcublas-12.8.3.14 | 460.2 MB | ###8 | 39% 2025-05-07T20:25:33.3598000Z 2025-05-07T20:25:33.3959970Z nsight-compute-2025. | 320.6 MB | #####7 | 57%  2025-05-07T20:25:33.4714139Z libcublas-12.8.3.14 | 460.2 MB | ###9 | 40% 2025-05-07T20:25:33.4714901Z 2025-05-07T20:25:33.4961642Z nsight-compute-2025. | 320.6 MB | #####8 | 58%  2025-05-07T20:25:33.5756629Z libcublas-12.8.3.14 | 460.2 MB | #### | 41% 2025-05-07T20:25:33.5757539Z 2025-05-07T20:25:33.5965073Z nsight-compute-2025. | 320.6 MB | #####9 | 60%  2025-05-07T20:25:33.6850921Z libcublas-12.8.3.14 | 460.2 MB | ####1 | 41% 2025-05-07T20:25:33.6851412Z 2025-05-07T20:25:33.6965752Z nsight-compute-2025. | 320.6 MB | ######1 | 61%  2025-05-07T20:25:33.7852263Z libcublas-12.8.3.14 | 460.2 MB | ####2 | 42% 2025-05-07T20:25:33.7853665Z 2025-05-07T20:25:33.8027190Z nsight-compute-2025. | 320.6 MB | ######2 | 63%  2025-05-07T20:25:33.8853327Z libcublas-12.8.3.14 | 460.2 MB | ####3 | 43% 2025-05-07T20:25:33.8855002Z 2025-05-07T20:25:33.9222880Z nsight-compute-2025. | 320.6 MB | ######3 | 64%  2025-05-07T20:25:33.9857017Z libcublas-12.8.3.14 | 460.2 MB | ####4 | 44% 2025-05-07T20:25:33.9858673Z 2025-05-07T20:25:34.0223453Z nsight-compute-2025. | 320.6 MB | ######5 | 66%  2025-05-07T20:25:34.0893119Z libcublas-12.8.3.14 | 460.2 MB | ####5 | 45% 2025-05-07T20:25:34.0893499Z 2025-05-07T20:25:34.1226534Z nsight-compute-2025. | 320.6 MB | ######7 | 67%  2025-05-07T20:25:34.1926831Z libcublas-12.8.3.14 | 460.2 MB | ####5 | 46% 2025-05-07T20:25:34.1927170Z 2025-05-07T20:25:34.2434860Z nsight-compute-2025. | 320.6 MB | ######8 | 69%  2025-05-07T20:25:34.2928275Z libcublas-12.8.3.14 | 460.2 MB | ####6 | 47% 2025-05-07T20:25:34.2928548Z 2025-05-07T20:25:34.3436778Z nsight-compute-2025. | 320.6 MB | ####### | 70%  2025-05-07T20:25:34.3987148Z libcublas-12.8.3.14 | 460.2 MB | ####7 | 48% 2025-05-07T20:25:34.3987425Z 2025-05-07T20:25:34.4661927Z nsight-compute-2025. | 320.6 MB | #######2 | 72%  2025-05-07T20:25:34.4987467Z libcublas-12.8.3.14 | 460.2 MB | ####8 | 49% 2025-05-07T20:25:34.4989454Z 2025-05-07T20:25:34.5663356Z nsight-compute-2025. | 320.6 MB | #######3 | 74%  2025-05-07T20:25:34.5992050Z libcublas-12.8.3.14 | 460.2 MB | ####9 | 50% 2025-05-07T20:25:34.5992357Z 2025-05-07T20:25:34.6663582Z nsight-compute-2025. | 320.6 MB | #######5 | 75%  2025-05-07T20:25:34.7014443Z libcublas-12.8.3.14 | 460.2 MB | ##### | 51% 2025-05-07T20:25:34.7014710Z 2025-05-07T20:25:34.7665544Z nsight-compute-2025. | 320.6 MB | #######6 | 77%  2025-05-07T20:25:34.8133162Z libcublas-12.8.3.14 | 460.2 MB | #####1 | 52% 2025-05-07T20:25:34.8133821Z 2025-05-07T20:25:34.8669551Z nsight-compute-2025. | 320.6 MB | #######8 | 79%  2025-05-07T20:25:34.9228517Z libcublas-12.8.3.14 | 460.2 MB | #####2 | 53% 2025-05-07T20:25:34.9228855Z 2025-05-07T20:25:34.9670342Z nsight-compute-2025. | 320.6 MB | ######## | 80%  2025-05-07T20:25:35.0228955Z libcublas-12.8.3.14 | 460.2 MB | #####3 | 54% 2025-05-07T20:25:35.0229260Z 2025-05-07T20:25:35.1247406Z nsight-compute-2025. | 320.6 MB | ########1 | 82%  2025-05-07T20:25:35.1248329Z 2025-05-07T20:25:35.1304804Z nsight-compute-2025. | 320.6 MB | ########3 | 83%  2025-05-07T20:25:35.2247226Z libcublas-12.8.3.14 | 460.2 MB | #####4 | 55% 2025-05-07T20:25:35.2247509Z 2025-05-07T20:25:35.2336123Z nsight-compute-2025. | 320.6 MB | ########5 | 85%  2025-05-07T20:25:35.3312811Z libcublas-12.8.3.14 | 460.2 MB | #####5 | 56% 2025-05-07T20:25:35.3313119Z 2025-05-07T20:25:35.3336672Z nsight-compute-2025. | 320.6 MB | ########6 | 87%  2025-05-07T20:25:35.4328714Z libcublas-12.8.3.14 | 460.2 MB | #####6 | 57% 2025-05-07T20:25:35.4328996Z 2025-05-07T20:25:35.4346172Z nsight-compute-2025. | 320.6 MB | ########8 | 88%  2025-05-07T20:25:35.4821393Z libcublas-12.8.3.14 | 460.2 MB | #####7 | 58% 2025-05-07T20:25:35.4821759Z 2025-05-07T20:25:35.4821764Z 2025-05-07T20:25:35.4821768Z 2025-05-07T20:25:35.5085215Z libcusolver-11.7.2.5 | 156.9 MB | ########## | 100%  2025-05-07T20:25:35.5085535Z 2025-05-07T20:25:35.5085539Z 2025-05-07T20:25:35.5085543Z 2025-05-07T20:25:35.5089080Z 2025-05-07T20:25:35.5385882Z libcufft-11.3.3.41 | 147.4 MB | ########## | 100%  2025-05-07T20:25:35.5393942Z libcublas-12.8.3.14 | 460.2 MB | #####8 | 59% 2025-05-07T20:25:35.5394247Z 2025-05-07T20:25:35.5394253Z 2025-05-07T20:25:35.5394258Z 2025-05-07T20:25:35.5394263Z 2025-05-07T20:25:35.5402504Z 2025-05-07T20:25:35.5697984Z libnpp-12.3.3.65 | 130.6 MB | | 0%  2025-05-07T20:25:35.5698369Z 2025-05-07T20:25:35.5739326Z nsight-compute-2025. | 320.6 MB | ######### | 90%  2025-05-07T20:25:35.5739684Z 2025-05-07T20:25:35.5739689Z 2025-05-07T20:25:35.5739693Z 2025-05-07T20:25:35.5739696Z 2025-05-07T20:25:35.5739700Z 2025-05-07T20:25:35.5740796Z 2025-05-07T20:25:35.6398314Z cuda-nsight-12.8.55 | 113.2 MB | | 0%  2025-05-07T20:25:35.6399128Z 2025-05-07T20:25:35.6399134Z 2025-05-07T20:25:35.6399138Z 2025-05-07T20:25:35.6399142Z 2025-05-07T20:25:35.6399156Z 2025-05-07T20:25:35.6741379Z libnpp-12.3.3.65 | 130.6 MB | 2 | 2%  2025-05-07T20:25:35.6741821Z 2025-05-07T20:25:35.6741827Z 2025-05-07T20:25:35.6741833Z 2025-05-07T20:25:35.6741839Z 2025-05-07T20:25:35.6741858Z 2025-05-07T20:25:35.6742709Z 2025-05-07T20:25:35.6862316Z cuda-nsight-12.8.55 | 113.2 MB | 2 | 3%  2025-05-07T20:25:35.7371269Z libcublas-12.8.3.14 | 460.2 MB | #####9 | 60% 2025-05-07T20:25:35.7372301Z 2025-05-07T20:25:35.7406079Z nsight-compute-2025. | 320.6 MB | #########1 | 92%  2025-05-07T20:25:35.7406366Z 2025-05-07T20:25:35.7406371Z 2025-05-07T20:25:35.7406374Z 2025-05-07T20:25:35.7406378Z 2025-05-07T20:25:35.7408944Z 2025-05-07T20:25:35.7743696Z libnpp-12.3.3.65 | 130.6 MB | 4 | 5%  2025-05-07T20:25:35.7744023Z 2025-05-07T20:25:35.7744064Z 2025-05-07T20:25:35.7744086Z 2025-05-07T20:25:35.7744092Z 2025-05-07T20:25:35.7744097Z 2025-05-07T20:25:35.7745691Z 2025-05-07T20:25:35.8249008Z cuda-nsight-12.8.55 | 113.2 MB | 5 | 5%  2025-05-07T20:25:35.8421112Z libcublas-12.8.3.14 | 460.2 MB | ###### | 61% 2025-05-07T20:25:35.8421480Z 2025-05-07T20:25:35.8421635Z 2025-05-07T20:25:35.8421642Z 2025-05-07T20:25:35.8421647Z 2025-05-07T20:25:35.8421990Z 2025-05-07T20:25:35.8747675Z libnpp-12.3.3.65 | 130.6 MB | 6 | 7%  2025-05-07T20:25:35.8748095Z 2025-05-07T20:25:35.8748101Z 2025-05-07T20:25:35.8748106Z 2025-05-07T20:25:35.8748111Z 2025-05-07T20:25:35.8748116Z 2025-05-07T20:25:35.8750717Z 2025-05-07T20:25:35.8774323Z cuda-nsight-12.8.55 | 113.2 MB | 7 | 8%  2025-05-07T20:25:35.8780814Z 2025-05-07T20:25:35.9043479Z nsight-compute-2025. | 320.6 MB | #########2 | 93%  2025-05-07T20:25:35.9043857Z 2025-05-07T20:25:35.9043900Z 2025-05-07T20:25:35.9425070Z libcusparse-12.5.7.5 | 164.9 MB | ########## | 100%  2025-05-07T20:25:35.9425474Z 2025-05-07T20:25:35.9425480Z 2025-05-07T20:25:35.9425485Z 2025-05-07T20:25:35.9425490Z 2025-05-07T20:25:35.9425504Z 2025-05-07T20:25:35.9551370Z libnpp-12.3.3.65 | 130.6 MB | 8 | 9%  2025-05-07T20:25:35.9551787Z 2025-05-07T20:25:35.9551794Z 2025-05-07T20:25:35.9551799Z 2025-05-07T20:25:35.9551804Z 2025-05-07T20:25:35.9551824Z 2025-05-07T20:25:35.9551830Z 2025-05-07T20:25:35.9561669Z 2025-05-07T20:25:35.9633405Z cuda-nvvp-12.8.57 | 112.4 MB | | 0%  2025-05-07T20:25:36.0197807Z libcublas-12.8.3.14 | 460.2 MB | ######1 | 61% 2025-05-07T20:25:36.0198141Z 2025-05-07T20:25:36.0198148Z 2025-05-07T20:25:36.0198165Z 2025-05-07T20:25:36.0198170Z 2025-05-07T20:25:36.0198175Z 2025-05-07T20:25:36.0198180Z 2025-05-07T20:25:36.0366176Z cuda-nsight-12.8.55 | 113.2 MB | # | 10%  2025-05-07T20:25:36.0372403Z 2025-05-07T20:25:36.0436663Z nsight-compute-2025. | 320.6 MB | #########4 | 94%  2025-05-07T20:25:36.0436948Z 2025-05-07T20:25:36.0436952Z 2025-05-07T20:25:36.0436956Z 2025-05-07T20:25:36.0436959Z 2025-05-07T20:25:36.0438424Z 2025-05-07T20:25:36.0553804Z libnpp-12.3.3.65 | 130.6 MB | # | 11%  2025-05-07T20:25:36.0554199Z 2025-05-07T20:25:36.0554212Z 2025-05-07T20:25:36.0554216Z 2025-05-07T20:25:36.0554220Z 2025-05-07T20:25:36.0554223Z 2025-05-07T20:25:36.0554227Z 2025-05-07T20:25:36.0557947Z 2025-05-07T20:25:36.0839996Z cuda-nvvp-12.8.57 | 112.4 MB | 2 | 2%  2025-05-07T20:25:36.1199371Z libcublas-12.8.3.14 | 460.2 MB | ######2 | 62% 2025-05-07T20:25:36.1199661Z 2025-05-07T20:25:36.1199665Z 2025-05-07T20:25:36.1199669Z 2025-05-07T20:25:36.1199682Z 2025-05-07T20:25:36.1199686Z 2025-05-07T20:25:36.1205454Z 2025-05-07T20:25:36.1537505Z cuda-nsight-12.8.55 | 113.2 MB | #2 | 12%  2025-05-07T20:25:36.1538338Z 2025-05-07T20:25:36.1538346Z 2025-05-07T20:25:36.1538362Z 2025-05-07T20:25:36.1538368Z 2025-05-07T20:25:36.1540932Z 2025-05-07T20:25:36.1560610Z libnpp-12.3.3.65 | 130.6 MB | #2 | 13%  2025-05-07T20:25:36.1561009Z 2025-05-07T20:25:36.1561025Z 2025-05-07T20:25:36.1561030Z 2025-05-07T20:25:36.1561035Z 2025-05-07T20:25:36.1561040Z 2025-05-07T20:25:36.1561045Z 2025-05-07T20:25:36.1571111Z 2025-05-07T20:25:36.1852202Z cuda-nvvp-12.8.57 | 112.4 MB | 4 | 4%  2025-05-07T20:25:36.1854020Z 2025-05-07T20:25:36.2182158Z nsight-compute-2025. | 320.6 MB | #########5 | 95%  2025-05-07T20:25:36.2202366Z libcublas-12.8.3.14 | 460.2 MB | ######2 | 63% 2025-05-07T20:25:36.2202694Z 2025-05-07T20:25:36.2202699Z 2025-05-07T20:25:36.2202703Z 2025-05-07T20:25:36.2202707Z 2025-05-07T20:25:36.2202711Z 2025-05-07T20:25:36.2202715Z 2025-05-07T20:25:36.2564365Z cuda-nsight-12.8.55 | 113.2 MB | #4 | 14%  2025-05-07T20:25:36.2564842Z 2025-05-07T20:25:36.2564848Z 2025-05-07T20:25:36.2564853Z 2025-05-07T20:25:36.2564859Z 2025-05-07T20:25:36.2564864Z 2025-05-07T20:25:36.2564869Z 2025-05-07T20:25:36.2565002Z 2025-05-07T20:25:36.2595828Z cuda-nvvp-12.8.57 | 112.4 MB | 6 | 7%  2025-05-07T20:25:36.2596164Z 2025-05-07T20:25:36.2596170Z 2025-05-07T20:25:36.2596181Z 2025-05-07T20:25:36.2596186Z 2025-05-07T20:25:36.2596191Z 2025-05-07T20:25:36.3066463Z libnpp-12.3.3.65 | 130.6 MB | #4 | 15%  2025-05-07T20:25:36.3067613Z 2025-05-07T20:25:36.3228789Z nsight-compute-2025. | 320.6 MB | #########6 | 96%  2025-05-07T20:25:36.3229077Z 2025-05-07T20:25:36.3229083Z 2025-05-07T20:25:36.3229088Z 2025-05-07T20:25:36.3229093Z 2025-05-07T20:25:36.3229098Z 2025-05-07T20:25:36.3229101Z 2025-05-07T20:25:36.3388254Z cuda-nsight-12.8.55 | 113.2 MB | #6 | 16%  2025-05-07T20:25:36.3565188Z libcublas-12.8.3.14 | 460.2 MB | ######3 | 63% 2025-05-07T20:25:36.3565577Z 2025-05-07T20:25:36.3565583Z 2025-05-07T20:25:36.3565589Z 2025-05-07T20:25:36.3565594Z 2025-05-07T20:25:36.3565599Z 2025-05-07T20:25:36.3565603Z 2025-05-07T20:25:36.3567881Z 2025-05-07T20:25:36.3609182Z cuda-nvvp-12.8.57 | 112.4 MB | 8 | 9%  2025-05-07T20:25:36.3609591Z 2025-05-07T20:25:36.3609596Z 2025-05-07T20:25:36.3609600Z 2025-05-07T20:25:36.3609605Z 2025-05-07T20:25:36.3609609Z 2025-05-07T20:25:36.4249348Z libnpp-12.3.3.65 | 130.6 MB | #6 | 17%  2025-05-07T20:25:36.4249674Z 2025-05-07T20:25:36.4249679Z 2025-05-07T20:25:36.4249682Z 2025-05-07T20:25:36.4249686Z 2025-05-07T20:25:36.4249690Z 2025-05-07T20:25:36.4251299Z 2025-05-07T20:25:36.4273190Z cuda-nsight-12.8.55 | 113.2 MB | #8 | 18%  2025-05-07T20:25:36.4273626Z 2025-05-07T20:25:36.4566516Z nsight-compute-2025. | 320.6 MB | #########6 | 97%  2025-05-07T20:25:36.4566846Z 2025-05-07T20:25:36.4567079Z 2025-05-07T20:25:36.4567093Z 2025-05-07T20:25:36.4567097Z 2025-05-07T20:25:36.4567101Z 2025-05-07T20:25:36.4567105Z 2025-05-07T20:25:36.4567894Z 2025-05-07T20:25:36.4607805Z cuda-nvvp-12.8.57 | 112.4 MB | # | 11%  2025-05-07T20:25:36.4612135Z libcublas-12.8.3.14 | 460.2 MB | ######3 | 64% 2025-05-07T20:25:36.4612486Z 2025-05-07T20:25:36.4612493Z 2025-05-07T20:25:36.4612498Z 2025-05-07T20:25:36.4612503Z 2025-05-07T20:25:36.4612509Z 2025-05-07T20:25:36.5250433Z libnpp-12.3.3.65 | 130.6 MB | #8 | 19%  2025-05-07T20:25:36.5250747Z 2025-05-07T20:25:36.5250751Z 2025-05-07T20:25:36.5250755Z 2025-05-07T20:25:36.5250759Z 2025-05-07T20:25:36.5250763Z 2025-05-07T20:25:36.5251409Z 2025-05-07T20:25:36.5431283Z cuda-nsight-12.8.55 | 113.2 MB | ## | 20%  2025-05-07T20:25:36.5436313Z 2025-05-07T20:25:36.5566314Z nsight-compute-2025. | 320.6 MB | #########7 | 98%  2025-05-07T20:25:36.5566950Z 2025-05-07T20:25:36.5566957Z 2025-05-07T20:25:36.5566960Z 2025-05-07T20:25:36.5566964Z 2025-05-07T20:25:36.5566968Z 2025-05-07T20:25:36.5566971Z 2025-05-07T20:25:36.5567641Z 2025-05-07T20:25:36.5654135Z cuda-nvvp-12.8.57 | 112.4 MB | #2 | 13%  2025-05-07T20:25:36.5689318Z libcublas-12.8.3.14 | 460.2 MB | ######4 | 64% 2025-05-07T20:25:36.5689676Z 2025-05-07T20:25:36.5689683Z 2025-05-07T20:25:36.5689688Z 2025-05-07T20:25:36.5689693Z 2025-05-07T20:25:36.5689699Z 2025-05-07T20:25:36.6250415Z libnpp-12.3.3.65 | 130.6 MB | ## | 21%  2025-05-07T20:25:36.6250729Z 2025-05-07T20:25:36.6250733Z 2025-05-07T20:25:36.6250738Z 2025-05-07T20:25:36.6250742Z 2025-05-07T20:25:36.6250747Z 2025-05-07T20:25:36.6250765Z 2025-05-07T20:25:36.6645097Z cuda-nsight-12.8.55 | 113.2 MB | ##2 | 22%  2025-05-07T20:25:36.6645425Z 2025-05-07T20:25:36.6645429Z 2025-05-07T20:25:36.6645433Z 2025-05-07T20:25:36.6645477Z 2025-05-07T20:25:36.6645494Z 2025-05-07T20:25:36.6645498Z 2025-05-07T20:25:36.6651242Z 2025-05-07T20:25:36.6670798Z cuda-nvvp-12.8.57 | 112.4 MB | #4 | 15%  2025-05-07T20:25:36.6671092Z 2025-05-07T20:25:36.6758398Z nsight-compute-2025. | 320.6 MB | #########8 | 99%  2025-05-07T20:25:36.6828610Z libcublas-12.8.3.14 | 460.2 MB | ######5 | 65% 2025-05-07T20:25:36.6828862Z 2025-05-07T20:25:36.6828866Z 2025-05-07T20:25:36.6828870Z 2025-05-07T20:25:36.6828874Z 2025-05-07T20:25:36.6831384Z 2025-05-07T20:25:36.7263560Z libnpp-12.3.3.65 | 130.6 MB | ##2 | 23%  2025-05-07T20:25:36.7263989Z 2025-05-07T20:25:36.7263994Z 2025-05-07T20:25:36.7263999Z 2025-05-07T20:25:36.7264005Z 2025-05-07T20:25:36.7264011Z 2025-05-07T20:25:36.7267807Z 2025-05-07T20:25:36.7651041Z cuda-nsight-12.8.55 | 113.2 MB | ##4 | 24%  2025-05-07T20:25:36.7651381Z 2025-05-07T20:25:36.7651385Z 2025-05-07T20:25:36.7651421Z 2025-05-07T20:25:36.7651435Z 2025-05-07T20:25:36.7651439Z 2025-05-07T20:25:36.7651444Z 2025-05-07T20:25:36.7659552Z 2025-05-07T20:25:36.7749434Z cuda-nvvp-12.8.57 | 112.4 MB | #6 | 17%  2025-05-07T20:25:36.7750017Z 2025-05-07T20:25:36.7840175Z nsight-compute-2025. | 320.6 MB | #########9 | 99%  2025-05-07T20:25:36.7860172Z libcublas-12.8.3.14 | 460.2 MB | ######5 | 66% 2025-05-07T20:25:36.7860542Z 2025-05-07T20:25:36.7860549Z 2025-05-07T20:25:36.7860554Z 2025-05-07T20:25:36.7860559Z 2025-05-07T20:25:36.7862353Z 2025-05-07T20:25:36.8265364Z libnpp-12.3.3.65 | 130.6 MB | ##4 | 24%  2025-05-07T20:25:36.8265761Z 2025-05-07T20:25:36.8265766Z 2025-05-07T20:25:36.8265772Z 2025-05-07T20:25:36.8265778Z 2025-05-07T20:25:36.8265783Z 2025-05-07T20:25:36.8265788Z 2025-05-07T20:25:36.8669415Z cuda-nsight-12.8.55 | 113.2 MB | ##6 | 27%  2025-05-07T20:25:36.8669952Z 2025-05-07T20:25:36.8669987Z 2025-05-07T20:25:36.8670261Z 2025-05-07T20:25:36.8670268Z 2025-05-07T20:25:36.8670273Z 2025-05-07T20:25:36.8670278Z 2025-05-07T20:25:36.8670698Z 2025-05-07T20:25:36.8868574Z cuda-nvvp-12.8.57 | 112.4 MB | #8 | 19%  2025-05-07T20:25:36.8868993Z 2025-05-07T20:25:36.8869000Z 2025-05-07T20:25:36.8869005Z 2025-05-07T20:25:36.8869010Z 2025-05-07T20:25:36.8869862Z 2025-05-07T20:25:36.9265958Z libnpp-12.3.3.65 | 130.6 MB | ##6 | 26%  2025-05-07T20:25:36.9266365Z 2025-05-07T20:25:36.9266370Z 2025-05-07T20:25:36.9266376Z 2025-05-07T20:25:36.9266381Z 2025-05-07T20:25:36.9266386Z 2025-05-07T20:25:36.9269048Z 2025-05-07T20:25:36.9709870Z cuda-nsight-12.8.55 | 113.2 MB | ##9 | 29%  2025-05-07T20:25:36.9870257Z libcublas-12.8.3.14 | 460.2 MB | ######6 | 66% 2025-05-07T20:25:36.9870630Z 2025-05-07T20:25:36.9870636Z 2025-05-07T20:25:36.9870641Z 2025-05-07T20:25:36.9870646Z 2025-05-07T20:25:36.9874678Z 2025-05-07T20:25:37.0266567Z libnpp-12.3.3.65 | 130.6 MB | ##8 | 29%  2025-05-07T20:25:37.0266982Z 2025-05-07T20:25:37.0266988Z 2025-05-07T20:25:37.0266993Z 2025-05-07T20:25:37.0266998Z 2025-05-07T20:25:37.0267004Z 2025-05-07T20:25:37.0275968Z 2025-05-07T20:25:37.0284663Z cuda-nsight-12.8.55 | 113.2 MB | ###2 | 32%  2025-05-07T20:25:37.0285088Z 2025-05-07T20:25:37.0285094Z 2025-05-07T20:25:37.0285099Z 2025-05-07T20:25:37.0285104Z 2025-05-07T20:25:37.0285109Z 2025-05-07T20:25:37.0285114Z 2025-05-07T20:25:37.0285119Z 2025-05-07T20:25:37.0711891Z cuda-nvvp-12.8.57 | 112.4 MB | ## | 21%  2025-05-07T20:25:37.0965440Z libcublas-12.8.3.14 | 460.2 MB | ######6 | 67% 2025-05-07T20:25:37.0965817Z 2025-05-07T20:25:37.0965823Z 2025-05-07T20:25:37.0965828Z 2025-05-07T20:25:37.0965833Z 2025-05-07T20:25:37.0965838Z 2025-05-07T20:25:37.1287260Z libnpp-12.3.3.65 | 130.6 MB | ### | 31%  2025-05-07T20:25:37.1287606Z 2025-05-07T20:25:37.1287611Z 2025-05-07T20:25:37.1287617Z 2025-05-07T20:25:37.1287622Z 2025-05-07T20:25:37.1287627Z 2025-05-07T20:25:37.1287632Z 2025-05-07T20:25:37.1287644Z 2025-05-07T20:25:37.1298490Z cuda-nvvp-12.8.57 | 112.4 MB | ##2 | 23%  2025-05-07T20:25:37.1298861Z 2025-05-07T20:25:37.1298867Z 2025-05-07T20:25:37.1298872Z 2025-05-07T20:25:37.1298877Z 2025-05-07T20:25:37.1298894Z 2025-05-07T20:25:37.1298900Z 2025-05-07T20:25:37.1721210Z cuda-nsight-12.8.55 | 113.2 MB | ###4 | 34%  2025-05-07T20:25:37.1985124Z libcublas-12.8.3.14 | 460.2 MB | ######6 | 67% 2025-05-07T20:25:37.1985405Z 2025-05-07T20:25:37.1985411Z 2025-05-07T20:25:37.1985415Z 2025-05-07T20:25:37.1985418Z 2025-05-07T20:25:37.1985422Z 2025-05-07T20:25:37.2288428Z libnpp-12.3.3.65 | 130.6 MB | ###2 | 33%  2025-05-07T20:25:37.2288737Z 2025-05-07T20:25:37.2288742Z 2025-05-07T20:25:37.2288745Z 2025-05-07T20:25:37.2288773Z 2025-05-07T20:25:37.2288787Z 2025-05-07T20:25:37.2288791Z 2025-05-07T20:25:37.2289911Z 2025-05-07T20:25:37.2379062Z cuda-nvvp-12.8.57 | 112.4 MB | ##4 | 25%  2025-05-07T20:25:37.2379370Z 2025-05-07T20:25:37.2379375Z 2025-05-07T20:25:37.2379379Z 2025-05-07T20:25:37.2379382Z 2025-05-07T20:25:37.2379386Z 2025-05-07T20:25:37.2379400Z 2025-05-07T20:25:37.2727257Z cuda-nsight-12.8.55 | 113.2 MB | ###6 | 37%  2025-05-07T20:25:37.3039999Z libcublas-12.8.3.14 | 460.2 MB | ######7 | 68% 2025-05-07T20:25:37.3040301Z 2025-05-07T20:25:37.3040306Z 2025-05-07T20:25:37.3040310Z 2025-05-07T20:25:37.3040313Z 2025-05-07T20:25:37.3041442Z 2025-05-07T20:25:37.3291701Z libnpp-12.3.3.65 | 130.6 MB | ###4 | 34%  2025-05-07T20:25:37.3292007Z 2025-05-07T20:25:37.3292012Z 2025-05-07T20:25:37.3292015Z 2025-05-07T20:25:37.3292019Z 2025-05-07T20:25:37.3292023Z 2025-05-07T20:25:37.3292026Z 2025-05-07T20:25:37.3294590Z 2025-05-07T20:25:37.3442878Z cuda-nvvp-12.8.57 | 112.4 MB | ##7 | 27%  2025-05-07T20:25:37.3443251Z 2025-05-07T20:25:37.3443257Z 2025-05-07T20:25:37.3443262Z 2025-05-07T20:25:37.3443266Z 2025-05-07T20:25:37.3443271Z 2025-05-07T20:25:37.3450000Z 2025-05-07T20:25:37.3727549Z cuda-nsight-12.8.55 | 113.2 MB | ###9 | 39%  2025-05-07T20:25:37.4040754Z libcublas-12.8.3.14 | 460.2 MB | ######8 | 68% 2025-05-07T20:25:37.4041044Z 2025-05-07T20:25:37.4041049Z 2025-05-07T20:25:37.4041053Z 2025-05-07T20:25:37.4041057Z 2025-05-07T20:25:37.4046187Z 2025-05-07T20:25:37.4296701Z libnpp-12.3.3.65 | 130.6 MB | ###6 | 36%  2025-05-07T20:25:37.4297033Z 2025-05-07T20:25:37.4297038Z 2025-05-07T20:25:37.4297042Z 2025-05-07T20:25:37.4297046Z 2025-05-07T20:25:37.4297058Z 2025-05-07T20:25:37.4297063Z 2025-05-07T20:25:37.4302635Z 2025-05-07T20:25:37.4450640Z cuda-nvvp-12.8.57 | 112.4 MB | ##9 | 30%  2025-05-07T20:25:37.4451346Z 2025-05-07T20:25:37.4451363Z 2025-05-07T20:25:37.4451369Z 2025-05-07T20:25:37.4451374Z 2025-05-07T20:25:37.4451379Z 2025-05-07T20:25:37.4451384Z 2025-05-07T20:25:37.4735729Z cuda-nsight-12.8.55 | 113.2 MB | ####1 | 42%  2025-05-07T20:25:37.5061597Z libcublas-12.8.3.14 | 460.2 MB | ######8 | 69% 2025-05-07T20:25:37.5062002Z 2025-05-07T20:25:37.5062008Z 2025-05-07T20:25:37.5062014Z 2025-05-07T20:25:37.5062020Z 2025-05-07T20:25:37.5063526Z 2025-05-07T20:25:37.5338446Z libnpp-12.3.3.65 | 130.6 MB | ###8 | 38%  2025-05-07T20:25:37.5338756Z 2025-05-07T20:25:37.5338760Z 2025-05-07T20:25:37.5338764Z 2025-05-07T20:25:37.5338768Z 2025-05-07T20:25:37.5338772Z 2025-05-07T20:25:37.5338785Z 2025-05-07T20:25:37.5343853Z 2025-05-07T20:25:37.5459467Z cuda-nvvp-12.8.57 | 112.4 MB | ###1 | 32%  2025-05-07T20:25:37.5459773Z 2025-05-07T20:25:37.5459787Z 2025-05-07T20:25:37.5459819Z 2025-05-07T20:25:37.5459833Z 2025-05-07T20:25:37.5459837Z 2025-05-07T20:25:37.5461546Z 2025-05-07T20:25:37.5832363Z cuda-nsight-12.8.55 | 113.2 MB | ####4 | 44%  2025-05-07T20:25:37.6121420Z libcublas-12.8.3.14 | 460.2 MB | ######9 | 69% 2025-05-07T20:25:37.6121696Z 2025-05-07T20:25:37.6121700Z 2025-05-07T20:25:37.6121704Z 2025-05-07T20:25:37.6121708Z 2025-05-07T20:25:37.6121714Z 2025-05-07T20:25:37.6341493Z libnpp-12.3.3.65 | 130.6 MB | #### | 40%  2025-05-07T20:25:37.6341782Z 2025-05-07T20:25:37.6341786Z 2025-05-07T20:25:37.6341790Z 2025-05-07T20:25:37.6341793Z 2025-05-07T20:25:37.6341800Z 2025-05-07T20:25:37.6341811Z 2025-05-07T20:25:37.6343448Z 2025-05-07T20:25:37.6469207Z cuda-nvvp-12.8.57 | 112.4 MB | ###4 | 34%  2025-05-07T20:25:37.6469562Z 2025-05-07T20:25:37.6469567Z 2025-05-07T20:25:37.6469571Z 2025-05-07T20:25:37.6469575Z 2025-05-07T20:25:37.6469578Z 2025-05-07T20:25:37.6473449Z 2025-05-07T20:25:37.6865155Z cuda-nsight-12.8.55 | 113.2 MB | ####6 | 47%  2025-05-07T20:25:37.7181627Z libcublas-12.8.3.14 | 460.2 MB | ######9 | 70% 2025-05-07T20:25:37.7181918Z 2025-05-07T20:25:37.7181922Z 2025-05-07T20:25:37.7181926Z 2025-05-07T20:25:37.7181930Z 2025-05-07T20:25:37.7181934Z 2025-05-07T20:25:37.7375439Z libnpp-12.3.3.65 | 130.6 MB | ####2 | 42%  2025-05-07T20:25:37.7375734Z 2025-05-07T20:25:37.7375738Z 2025-05-07T20:25:37.7375743Z 2025-05-07T20:25:37.7375747Z 2025-05-07T20:25:37.7375751Z 2025-05-07T20:25:37.7375755Z 2025-05-07T20:25:37.7375759Z 2025-05-07T20:25:37.7489238Z cuda-nvvp-12.8.57 | 112.4 MB | ###6 | 36%  2025-05-07T20:25:37.7489664Z 2025-05-07T20:25:37.7489669Z 2025-05-07T20:25:37.7489674Z 2025-05-07T20:25:37.7489680Z 2025-05-07T20:25:37.7489686Z 2025-05-07T20:25:37.7489691Z 2025-05-07T20:25:37.7892564Z cuda-nsight-12.8.55 | 113.2 MB | ####9 | 49%  2025-05-07T20:25:37.8185093Z libcublas-12.8.3.14 | 460.2 MB | ####### | 70% 2025-05-07T20:25:37.8185472Z 2025-05-07T20:25:37.8185478Z 2025-05-07T20:25:37.8185483Z 2025-05-07T20:25:37.8185488Z 2025-05-07T20:25:37.8185493Z 2025-05-07T20:25:37.8382636Z libnpp-12.3.3.65 | 130.6 MB | ####4 | 44%  2025-05-07T20:25:37.8383218Z 2025-05-07T20:25:37.8383223Z 2025-05-07T20:25:37.8383229Z 2025-05-07T20:25:37.8383234Z 2025-05-07T20:25:37.8383239Z 2025-05-07T20:25:37.8383244Z 2025-05-07T20:25:37.8388234Z 2025-05-07T20:25:37.8534478Z cuda-nvvp-12.8.57 | 112.4 MB | ###8 | 39%  2025-05-07T20:25:37.8534898Z 2025-05-07T20:25:37.8534904Z 2025-05-07T20:25:37.8534909Z 2025-05-07T20:25:37.8534915Z 2025-05-07T20:25:37.8534920Z 2025-05-07T20:25:37.8534925Z 2025-05-07T20:25:37.8926385Z cuda-nsight-12.8.55 | 113.2 MB | #####1 | 51%  2025-05-07T20:25:37.9185544Z libcublas-12.8.3.14 | 460.2 MB | ####### | 71% 2025-05-07T20:25:37.9186163Z 2025-05-07T20:25:37.9186184Z 2025-05-07T20:25:37.9186198Z 2025-05-07T20:25:37.9186203Z 2025-05-07T20:25:37.9186213Z 2025-05-07T20:25:37.9383526Z libnpp-12.3.3.65 | 130.6 MB | ####6 | 46%  2025-05-07T20:25:37.9383912Z 2025-05-07T20:25:37.9383917Z 2025-05-07T20:25:37.9383941Z 2025-05-07T20:25:37.9383946Z 2025-05-07T20:25:37.9383951Z 2025-05-07T20:25:37.9383956Z 2025-05-07T20:25:37.9383962Z 2025-05-07T20:25:37.9574396Z cuda-nvvp-12.8.57 | 112.4 MB | ####1 | 41%  2025-05-07T20:25:37.9574952Z 2025-05-07T20:25:37.9574959Z 2025-05-07T20:25:37.9574964Z 2025-05-07T20:25:37.9574969Z 2025-05-07T20:25:37.9574974Z 2025-05-07T20:25:37.9574980Z 2025-05-07T20:25:37.9926337Z cuda-nsight-12.8.55 | 113.2 MB | #####3 | 54%  2025-05-07T20:25:38.0224613Z libcublas-12.8.3.14 | 460.2 MB | #######1 | 71% 2025-05-07T20:25:38.0224993Z 2025-05-07T20:25:38.0225000Z 2025-05-07T20:25:38.0225005Z 2025-05-07T20:25:38.0225039Z 2025-05-07T20:25:38.0226540Z 2025-05-07T20:25:38.0412037Z libnpp-12.3.3.65 | 130.6 MB | ####8 | 48%  2025-05-07T20:25:38.0412866Z 2025-05-07T20:25:38.0412872Z 2025-05-07T20:25:38.0412877Z 2025-05-07T20:25:38.0412891Z 2025-05-07T20:25:38.0412896Z 2025-05-07T20:25:38.0412901Z 2025-05-07T20:25:38.0414022Z 2025-05-07T20:25:38.0579239Z cuda-nvvp-12.8.57 | 112.4 MB | ####3 | 43%  2025-05-07T20:25:38.0579650Z 2025-05-07T20:25:38.0579656Z 2025-05-07T20:25:38.0579661Z 2025-05-07T20:25:38.0579666Z 2025-05-07T20:25:38.0579672Z 2025-05-07T20:25:38.0580814Z 2025-05-07T20:25:38.0927966Z cuda-nsight-12.8.55 | 113.2 MB | #####6 | 56%  2025-05-07T20:25:38.1259428Z libcublas-12.8.3.14 | 460.2 MB | #######1 | 72% 2025-05-07T20:25:38.1259796Z 2025-05-07T20:25:38.1259800Z 2025-05-07T20:25:38.1259805Z 2025-05-07T20:25:38.1259808Z 2025-05-07T20:25:38.1261037Z 2025-05-07T20:25:38.1551556Z libnpp-12.3.3.65 | 130.6 MB | ##### | 50%  2025-05-07T20:25:38.1551998Z 2025-05-07T20:25:38.1552003Z 2025-05-07T20:25:38.1552007Z 2025-05-07T20:25:38.1552010Z 2025-05-07T20:25:38.1552014Z 2025-05-07T20:25:38.1552018Z 2025-05-07T20:25:38.1552022Z 2025-05-07T20:25:38.1608276Z cuda-nvvp-12.8.57 | 112.4 MB | ####5 | 46%  2025-05-07T20:25:38.1608584Z 2025-05-07T20:25:38.1608589Z 2025-05-07T20:25:38.1608592Z 2025-05-07T20:25:38.1608596Z 2025-05-07T20:25:38.1608600Z 2025-05-07T20:25:38.1610950Z 2025-05-07T20:25:38.1934787Z cuda-nsight-12.8.55 | 113.2 MB | #####8 | 59%  2025-05-07T20:25:38.2263177Z libcublas-12.8.3.14 | 460.2 MB | #######2 | 72% 2025-05-07T20:25:38.2263605Z 2025-05-07T20:25:38.2263609Z 2025-05-07T20:25:38.2263613Z 2025-05-07T20:25:38.2263617Z 2025-05-07T20:25:38.2266206Z 2025-05-07T20:25:38.2560247Z libnpp-12.3.3.65 | 130.6 MB | #####2 | 52%  2025-05-07T20:25:38.2560643Z 2025-05-07T20:25:38.2560674Z 2025-05-07T20:25:38.2560940Z 2025-05-07T20:25:38.2560950Z 2025-05-07T20:25:38.2560955Z 2025-05-07T20:25:38.2560959Z 2025-05-07T20:25:38.2564041Z 2025-05-07T20:25:38.2691120Z cuda-nvvp-12.8.57 | 112.4 MB | ####7 | 48%  2025-05-07T20:25:38.2691437Z 2025-05-07T20:25:38.2691441Z 2025-05-07T20:25:38.2691445Z 2025-05-07T20:25:38.2691448Z 2025-05-07T20:25:38.2691452Z 2025-05-07T20:25:38.2691456Z 2025-05-07T20:25:38.2934804Z cuda-nsight-12.8.55 | 113.2 MB | ######1 | 61%  2025-05-07T20:25:38.3271383Z libcublas-12.8.3.14 | 460.2 MB | #######2 | 73% 2025-05-07T20:25:38.3271692Z 2025-05-07T20:25:38.3271696Z 2025-05-07T20:25:38.3271700Z 2025-05-07T20:25:38.3271704Z 2025-05-07T20:25:38.3272894Z 2025-05-07T20:25:38.3564945Z libnpp-12.3.3.65 | 130.6 MB | #####4 | 54%  2025-05-07T20:25:38.3565260Z 2025-05-07T20:25:38.3565264Z 2025-05-07T20:25:38.3565268Z 2025-05-07T20:25:38.3565272Z 2025-05-07T20:25:38.3565276Z 2025-05-07T20:25:38.3565541Z 2025-05-07T20:25:38.3567656Z 2025-05-07T20:25:38.3735375Z cuda-nvvp-12.8.57 | 112.4 MB | ##### | 50%  2025-05-07T20:25:38.3735743Z 2025-05-07T20:25:38.3735749Z 2025-05-07T20:25:38.3735762Z 2025-05-07T20:25:38.3735767Z 2025-05-07T20:25:38.3735772Z 2025-05-07T20:25:38.3737333Z 2025-05-07T20:25:38.4273932Z cuda-nsight-12.8.55 | 113.2 MB | ######3 | 63%  2025-05-07T20:25:38.4274350Z 2025-05-07T20:25:38.4274356Z 2025-05-07T20:25:38.4274361Z 2025-05-07T20:25:38.4274365Z 2025-05-07T20:25:38.4275685Z 2025-05-07T20:25:38.4737613Z libnpp-12.3.3.65 | 130.6 MB | #####6 | 57%  2025-05-07T20:25:38.4737941Z 2025-05-07T20:25:38.4737946Z 2025-05-07T20:25:38.4737950Z 2025-05-07T20:25:38.4737962Z 2025-05-07T20:25:38.4737966Z 2025-05-07T20:25:38.4739187Z 2025-05-07T20:25:38.5060095Z cuda-nsight-12.8.55 | 113.2 MB | ######6 | 66%  2025-05-07T20:25:38.5275816Z libcublas-12.8.3.14 | 460.2 MB | #######3 | 73% 2025-05-07T20:25:38.5276249Z 2025-05-07T20:25:38.5276254Z 2025-05-07T20:25:38.5276258Z 2025-05-07T20:25:38.5276262Z 2025-05-07T20:25:38.5276265Z 2025-05-07T20:25:38.5400025Z libnpp-12.3.3.65 | 130.6 MB | #####9 | 59%  2025-05-07T20:25:38.5400394Z 2025-05-07T20:25:38.5400398Z 2025-05-07T20:25:38.5400402Z 2025-05-07T20:25:38.5400405Z 2025-05-07T20:25:38.5400409Z 2025-05-07T20:25:38.5400413Z 2025-05-07T20:25:38.5406270Z 2025-05-07T20:25:38.5784278Z cuda-nvvp-12.8.57 | 112.4 MB | #####2 | 53%  2025-05-07T20:25:38.5784639Z 2025-05-07T20:25:38.5784643Z 2025-05-07T20:25:38.5784647Z 2025-05-07T20:25:38.5784651Z 2025-05-07T20:25:38.5784666Z 2025-05-07T20:25:38.5786055Z 2025-05-07T20:25:38.6061582Z cuda-nsight-12.8.55 | 113.2 MB | ######8 | 69%  2025-05-07T20:25:38.6319701Z libcublas-12.8.3.14 | 460.2 MB | #######3 | 74% 2025-05-07T20:25:38.6320051Z 2025-05-07T20:25:38.6320056Z 2025-05-07T20:25:38.6320089Z 2025-05-07T20:25:38.6320105Z 2025-05-07T20:25:38.6321437Z 2025-05-07T20:25:38.6406114Z libnpp-12.3.3.65 | 130.6 MB | ######1 | 61%  2025-05-07T20:25:38.6406512Z 2025-05-07T20:25:38.6406520Z 2025-05-07T20:25:38.6406529Z 2025-05-07T20:25:38.6406538Z 2025-05-07T20:25:38.6406546Z 2025-05-07T20:25:38.6406555Z 2025-05-07T20:25:38.6406563Z 2025-05-07T20:25:38.6784468Z cuda-nvvp-12.8.57 | 112.4 MB | #####5 | 55%  2025-05-07T20:25:38.6784939Z 2025-05-07T20:25:38.6784947Z 2025-05-07T20:25:38.6784953Z 2025-05-07T20:25:38.6784960Z 2025-05-07T20:25:38.6784966Z 2025-05-07T20:25:38.6787405Z 2025-05-07T20:25:38.7061936Z cuda-nsight-12.8.55 | 113.2 MB | #######1 | 71%  2025-05-07T20:25:38.7408834Z libcublas-12.8.3.14 | 460.2 MB | #######4 | 74% 2025-05-07T20:25:38.7409257Z 2025-05-07T20:25:38.7409266Z 2025-05-07T20:25:38.7409273Z 2025-05-07T20:25:38.7409278Z 2025-05-07T20:25:38.7409283Z 2025-05-07T20:25:38.7409321Z 2025-05-07T20:25:38.7409606Z 2025-05-07T20:25:38.7786922Z cuda-nvvp-12.8.57 | 112.4 MB | #####7 | 57%  2025-05-07T20:25:38.7787380Z 2025-05-07T20:25:38.7787384Z 2025-05-07T20:25:38.7787388Z 2025-05-07T20:25:38.7787392Z 2025-05-07T20:25:38.7787395Z 2025-05-07T20:25:38.7787399Z 2025-05-07T20:25:38.8062973Z cuda-nsight-12.8.55 | 113.2 MB | #######3 | 74%  2025-05-07T20:25:38.8305624Z libcublas-12.8.3.14 | 460.2 MB | #######4 | 75% 2025-05-07T20:25:38.8305916Z 2025-05-07T20:25:38.8305920Z 2025-05-07T20:25:38.8305924Z 2025-05-07T20:25:38.8305928Z 2025-05-07T20:25:38.8305933Z 2025-05-07T20:25:38.8410034Z libnpp-12.3.3.65 | 130.6 MB | ######3 | 63%  2025-05-07T20:25:38.8410496Z 2025-05-07T20:25:38.8410503Z 2025-05-07T20:25:38.8410508Z 2025-05-07T20:25:38.8410513Z 2025-05-07T20:25:38.8410518Z 2025-05-07T20:25:38.8410523Z 2025-05-07T20:25:38.8412381Z 2025-05-07T20:25:38.8850667Z cuda-nvvp-12.8.57 | 112.4 MB | ###### | 60%  2025-05-07T20:25:38.8851354Z 2025-05-07T20:25:38.8851359Z 2025-05-07T20:25:38.8851362Z 2025-05-07T20:25:38.8851366Z 2025-05-07T20:25:38.8851370Z 2025-05-07T20:25:38.8852163Z 2025-05-07T20:25:38.9066132Z cuda-nsight-12.8.55 | 113.2 MB | #######6 | 76%  2025-05-07T20:25:38.9312920Z libcublas-12.8.3.14 | 460.2 MB | #######5 | 76% 2025-05-07T20:25:38.9313199Z 2025-05-07T20:25:38.9313381Z 2025-05-07T20:25:38.9313392Z 2025-05-07T20:25:38.9313397Z 2025-05-07T20:25:38.9324816Z 2025-05-07T20:25:38.9411163Z libnpp-12.3.3.65 | 130.6 MB | ######5 | 65%  2025-05-07T20:25:38.9411512Z 2025-05-07T20:25:38.9411518Z 2025-05-07T20:25:38.9411523Z 2025-05-07T20:25:38.9411532Z 2025-05-07T20:25:38.9411537Z 2025-05-07T20:25:38.9411542Z 2025-05-07T20:25:38.9411548Z 2025-05-07T20:25:38.9871414Z cuda-nvvp-12.8.57 | 112.4 MB | ######2 | 63%  2025-05-07T20:25:38.9871837Z 2025-05-07T20:25:38.9871874Z 2025-05-07T20:25:38.9871889Z 2025-05-07T20:25:38.9871894Z 2025-05-07T20:25:38.9871899Z 2025-05-07T20:25:38.9873209Z 2025-05-07T20:25:39.0067461Z cuda-nsight-12.8.55 | 113.2 MB | #######8 | 79%  2025-05-07T20:25:39.0315877Z libcublas-12.8.3.14 | 460.2 MB | #######6 | 76% 2025-05-07T20:25:39.0316214Z 2025-05-07T20:25:39.0316218Z 2025-05-07T20:25:39.0316222Z 2025-05-07T20:25:39.0316225Z 2025-05-07T20:25:39.0318817Z 2025-05-07T20:25:39.0416900Z libnpp-12.3.3.65 | 130.6 MB | ######7 | 68%  2025-05-07T20:25:39.0417272Z 2025-05-07T20:25:39.0417276Z 2025-05-07T20:25:39.0417280Z 2025-05-07T20:25:39.0417284Z 2025-05-07T20:25:39.0417287Z 2025-05-07T20:25:39.0417291Z 2025-05-07T20:25:39.0417574Z 2025-05-07T20:25:39.0958526Z cuda-nvvp-12.8.57 | 112.4 MB | ######5 | 65%  2025-05-07T20:25:39.0958853Z 2025-05-07T20:25:39.0958858Z 2025-05-07T20:25:39.0958861Z 2025-05-07T20:25:39.0958865Z 2025-05-07T20:25:39.0958897Z 2025-05-07T20:25:39.0959361Z 2025-05-07T20:25:39.1068553Z cuda-nsight-12.8.55 | 113.2 MB | ########1 | 81%  2025-05-07T20:25:39.1374175Z libcublas-12.8.3.14 | 460.2 MB | #######6 | 77% 2025-05-07T20:25:39.1374514Z 2025-05-07T20:25:39.1374521Z 2025-05-07T20:25:39.1374526Z 2025-05-07T20:25:39.1374531Z 2025-05-07T20:25:39.1374537Z 2025-05-07T20:25:39.1421689Z libnpp-12.3.3.65 | 130.6 MB | ######9 | 70%  2025-05-07T20:25:39.1422071Z 2025-05-07T20:25:39.1422092Z 2025-05-07T20:25:39.1422098Z 2025-05-07T20:25:39.1422103Z 2025-05-07T20:25:39.1422108Z 2025-05-07T20:25:39.1422112Z 2025-05-07T20:25:39.1423647Z 2025-05-07T20:25:39.2070684Z cuda-nvvp-12.8.57 | 112.4 MB | ######7 | 68%  2025-05-07T20:25:39.2192394Z libcublas-12.8.3.14 | 460.2 MB | #######7 | 77% 2025-05-07T20:25:39.2192763Z 2025-05-07T20:25:39.2192770Z 2025-05-07T20:25:39.2192775Z 2025-05-07T20:25:39.2192781Z 2025-05-07T20:25:39.2192786Z 2025-05-07T20:25:39.2192821Z 2025-05-07T20:25:39.2380465Z cuda-nsight-12.8.55 | 113.2 MB | ########3 | 84%  2025-05-07T20:25:39.2381002Z 2025-05-07T20:25:39.2381009Z 2025-05-07T20:25:39.2381014Z 2025-05-07T20:25:39.2381020Z 2025-05-07T20:25:39.2381441Z 2025-05-07T20:25:39.2505761Z libnpp-12.3.3.65 | 130.6 MB | #######1 | 72%  2025-05-07T20:25:39.2506080Z 2025-05-07T20:25:39.2506084Z 2025-05-07T20:25:39.2506088Z 2025-05-07T20:25:39.2506091Z 2025-05-07T20:25:39.2506095Z 2025-05-07T20:25:39.2506099Z 2025-05-07T20:25:39.2506102Z 2025-05-07T20:25:39.3074893Z cuda-nvvp-12.8.57 | 112.4 MB | ####### | 70%  2025-05-07T20:25:39.3385430Z libcublas-12.8.3.14 | 460.2 MB | #######7 | 78% 2025-05-07T20:25:39.3385743Z 2025-05-07T20:25:39.3385747Z 2025-05-07T20:25:39.3385751Z 2025-05-07T20:25:39.3385757Z 2025-05-07T20:25:39.3385761Z 2025-05-07T20:25:39.3508240Z libnpp-12.3.3.65 | 130.6 MB | #######4 | 74%  2025-05-07T20:25:39.3508842Z 2025-05-07T20:25:39.3508859Z 2025-05-07T20:25:39.3508863Z 2025-05-07T20:25:39.3508867Z 2025-05-07T20:25:39.3508871Z 2025-05-07T20:25:39.3508874Z 2025-05-07T20:25:39.3508878Z 2025-05-07T20:25:39.4077098Z cuda-nvvp-12.8.57 | 112.4 MB | #######2 | 73%  2025-05-07T20:25:39.4375228Z libcublas-12.8.3.14 | 460.2 MB | #######8 | 78% 2025-05-07T20:25:39.4375520Z 2025-05-07T20:25:39.4375524Z 2025-05-07T20:25:39.4375535Z 2025-05-07T20:25:39.4375539Z 2025-05-07T20:25:39.4375543Z 2025-05-07T20:25:39.4378042Z 2025-05-07T20:25:39.4389868Z cuda-nsight-12.8.55 | 113.2 MB | ########5 | 86%  2025-05-07T20:25:39.4390263Z 2025-05-07T20:25:39.4390268Z 2025-05-07T20:25:39.4390272Z 2025-05-07T20:25:39.4390275Z 2025-05-07T20:25:39.4393556Z 2025-05-07T20:25:39.4511072Z libnpp-12.3.3.65 | 130.6 MB | #######6 | 77%  2025-05-07T20:25:39.4511381Z 2025-05-07T20:25:39.4511387Z 2025-05-07T20:25:39.4511391Z 2025-05-07T20:25:39.4511422Z 2025-05-07T20:25:39.4511436Z 2025-05-07T20:25:39.4511440Z 2025-05-07T20:25:39.4511444Z 2025-05-07T20:25:39.5079671Z cuda-nvvp-12.8.57 | 112.4 MB | #######5 | 76%  2025-05-07T20:25:39.5375646Z libcublas-12.8.3.14 | 460.2 MB | #######8 | 79% 2025-05-07T20:25:39.5375987Z 2025-05-07T20:25:39.5375993Z 2025-05-07T20:25:39.5375998Z 2025-05-07T20:25:39.5376018Z 2025-05-07T20:25:39.5376028Z 2025-05-07T20:25:39.5376073Z 2025-05-07T20:25:39.5504840Z cuda-nsight-12.8.55 | 113.2 MB | ########8 | 88%  2025-05-07T20:25:39.5505176Z 2025-05-07T20:25:39.5505182Z 2025-05-07T20:25:39.5505186Z 2025-05-07T20:25:39.5505191Z 2025-05-07T20:25:39.5509098Z 2025-05-07T20:25:39.5535365Z libnpp-12.3.3.65 | 130.6 MB | #######8 | 79%  2025-05-07T20:25:39.5535661Z 2025-05-07T20:25:39.5535666Z 2025-05-07T20:25:39.5535670Z 2025-05-07T20:25:39.5535674Z 2025-05-07T20:25:39.5535677Z 2025-05-07T20:25:39.5535681Z 2025-05-07T20:25:39.5541526Z 2025-05-07T20:25:39.6121808Z cuda-nvvp-12.8.57 | 112.4 MB | #######8 | 78%  2025-05-07T20:25:39.6376049Z libcublas-12.8.3.14 | 460.2 MB | #######9 | 80% 2025-05-07T20:25:39.6376466Z 2025-05-07T20:25:39.6376472Z 2025-05-07T20:25:39.6376478Z 2025-05-07T20:25:39.6376483Z 2025-05-07T20:25:39.6376488Z 2025-05-07T20:25:39.6379945Z 2025-05-07T20:25:39.6540773Z cuda-nsight-12.8.55 | 113.2 MB | ######### | 91%  2025-05-07T20:25:39.6541097Z 2025-05-07T20:25:39.6541101Z 2025-05-07T20:25:39.6541105Z 2025-05-07T20:25:39.6541108Z 2025-05-07T20:25:39.6545805Z 2025-05-07T20:25:39.6585879Z libnpp-12.3.3.65 | 130.6 MB | ######## | 81%  2025-05-07T20:25:39.6586193Z 2025-05-07T20:25:39.6586199Z 2025-05-07T20:25:39.6586204Z 2025-05-07T20:25:39.6586210Z 2025-05-07T20:25:39.6586215Z 2025-05-07T20:25:39.6586220Z 2025-05-07T20:25:39.6586240Z 2025-05-07T20:25:39.7382241Z cuda-nvvp-12.8.57 | 112.4 MB | ######## | 81%  2025-05-07T20:25:39.7383174Z 2025-05-07T20:25:39.7383184Z 2025-05-07T20:25:39.7383188Z 2025-05-07T20:25:39.7383207Z 2025-05-07T20:25:39.7383210Z 2025-05-07T20:25:39.7386105Z 2025-05-07T20:25:39.7397143Z cuda-nsight-12.8.55 | 113.2 MB | #########3 | 93%  2025-05-07T20:25:39.7561140Z libcublas-12.8.3.14 | 460.2 MB | ######## | 80% 2025-05-07T20:25:39.7561414Z 2025-05-07T20:25:39.7561419Z 2025-05-07T20:25:39.7561422Z 2025-05-07T20:25:39.7561748Z 2025-05-07T20:25:39.7563738Z 2025-05-07T20:25:39.7588949Z libnpp-12.3.3.65 | 130.6 MB | ########2 | 83%  2025-05-07T20:25:39.7589383Z 2025-05-07T20:25:39.7589389Z 2025-05-07T20:25:39.7589394Z 2025-05-07T20:25:39.7589400Z 2025-05-07T20:25:39.7589405Z 2025-05-07T20:25:39.7589410Z 2025-05-07T20:25:39.7593607Z 2025-05-07T20:25:39.8401012Z cuda-nvvp-12.8.57 | 112.4 MB | ########3 | 83%  2025-05-07T20:25:39.8452291Z libcublas-12.8.3.14 | 460.2 MB | ######## | 81% 2025-05-07T20:25:39.8452892Z 2025-05-07T20:25:39.8452910Z 2025-05-07T20:25:39.8452914Z 2025-05-07T20:25:39.8452917Z 2025-05-07T20:25:39.8452921Z 2025-05-07T20:25:39.8456665Z 2025-05-07T20:25:39.8613871Z cuda-nsight-12.8.55 | 113.2 MB | #########5 | 96%  2025-05-07T20:25:39.8614208Z 2025-05-07T20:25:39.8614213Z 2025-05-07T20:25:39.8614216Z 2025-05-07T20:25:39.8614220Z 2025-05-07T20:25:39.8614224Z 2025-05-07T20:25:39.8675841Z libnpp-12.3.3.65 | 130.6 MB | ########4 | 85%  2025-05-07T20:25:39.8676156Z 2025-05-07T20:25:39.8676161Z 2025-05-07T20:25:39.8676165Z 2025-05-07T20:25:39.8676169Z 2025-05-07T20:25:39.8676173Z 2025-05-07T20:25:39.8676177Z 2025-05-07T20:25:39.8676180Z 2025-05-07T20:25:39.9411850Z cuda-nvvp-12.8.57 | 112.4 MB | ########5 | 86%  2025-05-07T20:25:39.9587877Z libcublas-12.8.3.14 | 460.2 MB | ########1 | 81% 2025-05-07T20:25:39.9588210Z 2025-05-07T20:25:39.9588216Z 2025-05-07T20:25:39.9588221Z 2025-05-07T20:25:39.9588260Z 2025-05-07T20:25:39.9588283Z 2025-05-07T20:25:39.9588290Z 2025-05-07T20:25:39.9648876Z cuda-nsight-12.8.55 | 113.2 MB | #########8 | 98%  2025-05-07T20:25:39.9649201Z 2025-05-07T20:25:39.9649204Z 2025-05-07T20:25:39.9649208Z 2025-05-07T20:25:39.9649212Z 2025-05-07T20:25:39.9652737Z 2025-05-07T20:25:39.9878728Z libnpp-12.3.3.65 | 130.6 MB | ########7 | 87%  2025-05-07T20:25:39.9879133Z 2025-05-07T20:25:39.9879139Z 2025-05-07T20:25:39.9879144Z 2025-05-07T20:25:39.9879150Z 2025-05-07T20:25:39.9879155Z 2025-05-07T20:25:39.9879161Z 2025-05-07T20:25:39.9881917Z 2025-05-07T20:25:40.0655157Z cuda-nvvp-12.8.57 | 112.4 MB | ########8 | 88%  2025-05-07T20:25:40.0655639Z 2025-05-07T20:25:40.0655645Z 2025-05-07T20:25:40.0655651Z 2025-05-07T20:25:40.0655656Z 2025-05-07T20:25:40.0655674Z 2025-05-07T20:25:40.0880032Z libnpp-12.3.3.65 | 130.6 MB | ########9 | 90%  2025-05-07T20:25:40.0880334Z 2025-05-07T20:25:40.0880368Z 2025-05-07T20:25:40.0880383Z 2025-05-07T20:25:40.0880387Z 2025-05-07T20:25:40.0880399Z 2025-05-07T20:25:40.0880402Z 2025-05-07T20:25:40.0880438Z 2025-05-07T20:25:40.1658524Z cuda-nvvp-12.8.57 | 112.4 MB | #########1 | 91%  2025-05-07T20:25:40.1658879Z 2025-05-07T20:25:40.1658883Z 2025-05-07T20:25:40.1658887Z 2025-05-07T20:25:40.1658890Z 2025-05-07T20:25:40.1662417Z 2025-05-07T20:25:40.1735327Z libnpp-12.3.3.65 | 130.6 MB | #########2 | 92%  2025-05-07T20:25:40.1883573Z libcublas-12.8.3.14 | 460.2 MB | ########1 | 82% 2025-05-07T20:25:40.1883861Z 2025-05-07T20:25:40.1883865Z 2025-05-07T20:25:40.1883869Z 2025-05-07T20:25:40.1883879Z 2025-05-07T20:25:40.1883883Z 2025-05-07T20:25:40.1883887Z 2025-05-07T20:25:40.1883891Z 2025-05-07T20:25:40.2738528Z cuda-nvvp-12.8.57 | 112.4 MB | #########4 | 94%  2025-05-07T20:25:40.2759689Z libcublas-12.8.3.14 | 460.2 MB | ########2 | 82% 2025-05-07T20:25:40.2759993Z 2025-05-07T20:25:40.2760229Z 2025-05-07T20:25:40.2760234Z 2025-05-07T20:25:40.2760238Z 2025-05-07T20:25:40.2760242Z 2025-05-07T20:25:40.2884529Z libnpp-12.3.3.65 | 130.6 MB | #########4 | 94%  2025-05-07T20:25:40.2884833Z 2025-05-07T20:25:40.2884837Z 2025-05-07T20:25:40.2884841Z 2025-05-07T20:25:40.2884844Z 2025-05-07T20:25:40.2884848Z 2025-05-07T20:25:40.2884852Z 2025-05-07T20:25:40.2884855Z 2025-05-07T20:25:40.3741226Z cuda-nvvp-12.8.57 | 112.4 MB | #########6 | 97%  2025-05-07T20:25:40.3766378Z libcublas-12.8.3.14 | 460.2 MB | ########3 | 83% 2025-05-07T20:25:40.3766647Z 2025-05-07T20:25:40.3766652Z 2025-05-07T20:25:40.3766656Z 2025-05-07T20:25:40.3766659Z 2025-05-07T20:25:40.3766663Z 2025-05-07T20:25:40.3890609Z libnpp-12.3.3.65 | 130.6 MB | #########6 | 96%  2025-05-07T20:25:40.3890915Z 2025-05-07T20:25:40.3890919Z 2025-05-07T20:25:40.3890923Z 2025-05-07T20:25:40.3890927Z 2025-05-07T20:25:40.3890930Z 2025-05-07T20:25:40.3891210Z 2025-05-07T20:25:40.3891214Z 2025-05-07T20:25:40.4742526Z cuda-nvvp-12.8.57 | 112.4 MB | #########9 | 99%  2025-05-07T20:25:40.4768810Z libcublas-12.8.3.14 | 460.2 MB | ########3 | 84% 2025-05-07T20:25:40.4769090Z 2025-05-07T20:25:40.4769094Z 2025-05-07T20:25:40.4769098Z 2025-05-07T20:25:40.4769102Z 2025-05-07T20:25:40.4770466Z 2025-05-07T20:25:40.5747263Z libnpp-12.3.3.65 | 130.6 MB | #########9 | 99%  2025-05-07T20:25:40.6748601Z libcublas-12.8.3.14 | 460.2 MB | ########4 | 85% 2025-05-07T20:25:40.8015868Z libcublas-12.8.3.14 | 460.2 MB | ########5 | 85% 2025-05-07T20:25:40.9018905Z libcublas-12.8.3.14 | 460.2 MB | ########5 | 86% 2025-05-07T20:25:41.0020910Z libcublas-12.8.3.14 | 460.2 MB | ########6 | 87% 2025-05-07T20:25:41.1028374Z libcublas-12.8.3.14 | 460.2 MB | ########7 | 87% 2025-05-07T20:25:41.2032351Z libcublas-12.8.3.14 | 460.2 MB | ########8 | 88% 2025-05-07T20:25:41.3502528Z libcublas-12.8.3.14 | 460.2 MB | ########9 | 89% 2025-05-07T20:25:41.4503228Z libcublas-12.8.3.14 | 460.2 MB | ########9 | 90% 2025-05-07T20:25:41.5693006Z libcublas-12.8.3.14 | 460.2 MB | ######### | 91% 2025-05-07T20:25:41.6947765Z libcublas-12.8.3.14 | 460.2 MB | #########1 | 91% 2025-05-07T20:25:41.7949437Z libcublas-12.8.3.14 | 460.2 MB | #########2 | 92% 2025-05-07T20:25:41.8950137Z libcublas-12.8.3.14 | 460.2 MB | #########2 | 93% 2025-05-07T20:25:41.9954570Z libcublas-12.8.3.14 | 460.2 MB | #########3 | 94% 2025-05-07T20:25:42.1052244Z libcublas-12.8.3.14 | 460.2 MB | #########4 | 94% 2025-05-07T20:25:42.2062765Z libcublas-12.8.3.14 | 460.2 MB | #########5 | 95% 2025-05-07T20:25:42.3064432Z libcublas-12.8.3.14 | 460.2 MB | #########5 | 96% 2025-05-07T20:25:42.4068433Z libcublas-12.8.3.14 | 460.2 MB | #########6 | 97% 2025-05-07T20:25:42.5136294Z libcublas-12.8.3.14 | 460.2 MB | #########7 | 97% 2025-05-07T20:25:42.6198836Z libcublas-12.8.3.14 | 460.2 MB | #########8 | 98% 2025-05-07T20:25:42.7225342Z libcublas-12.8.3.14 | 460.2 MB | #########8 | 99% 2025-05-07T20:25:43.8362223Z libcublas-12.8.3.14 | 460.2 MB | #########9 | 100% 2025-05-07T20:25:43.8362630Z 2025-05-07T20:25:43.8362637Z 2025-05-07T20:25:43.8362642Z 2025-05-07T20:25:43.8362647Z 2025-05-07T20:25:43.8362652Z 2025-05-07T20:25:43.8365145Z 2025-05-07T20:25:43.9229831Z cuda-nsight-12.8.55 | 113.2 MB | ########## | 100%  2025-05-07T20:25:43.9230251Z 2025-05-07T20:25:43.9230255Z 2025-05-07T20:25:43.9230259Z 2025-05-07T20:25:43.9230263Z 2025-05-07T20:25:43.9230267Z 2025-05-07T20:25:43.9230270Z 2025-05-07T20:25:43.9230274Z 2025-05-07T20:25:43.9234710Z 2025-05-07T20:25:44.0233468Z cuda-nvrtc-12.8.61 | 63.1 MB | | 0%  2025-05-07T20:25:44.0233855Z 2025-05-07T20:25:44.0233861Z 2025-05-07T20:25:44.0233866Z 2025-05-07T20:25:44.0233872Z 2025-05-07T20:25:44.0233877Z 2025-05-07T20:25:44.0233893Z 2025-05-07T20:25:44.0233923Z 2025-05-07T20:25:44.0234186Z 2025-05-07T20:25:44.0961554Z cuda-nvrtc-12.8.61 | 63.1 MB | 5 | 6%  2025-05-07T20:25:44.0961878Z 2025-05-07T20:25:44.0961883Z 2025-05-07T20:25:44.0961895Z 2025-05-07T20:25:44.0962988Z 2025-05-07T20:25:44.1308462Z libcufft-11.3.3.41 | 147.4 MB | ########## | 100%  2025-05-07T20:25:44.1308768Z 2025-05-07T20:25:44.1308783Z 2025-05-07T20:25:44.1308787Z 2025-05-07T20:25:44.1308791Z 2025-05-07T20:25:44.1308795Z 2025-05-07T20:25:44.1308799Z 2025-05-07T20:25:44.1308802Z 2025-05-07T20:25:44.1308806Z 2025-05-07T20:25:44.2309252Z cuda-nvrtc-12.8.61 | 63.1 MB | #1 | 11%  2025-05-07T20:25:44.2309825Z 2025-05-07T20:25:44.2309832Z 2025-05-07T20:25:44.2309838Z 2025-05-07T20:25:44.2309843Z 2025-05-07T20:25:44.2309848Z 2025-05-07T20:25:44.2309854Z 2025-05-07T20:25:44.2309859Z 2025-05-07T20:25:44.2309864Z 2025-05-07T20:25:44.3309318Z cuda-nvrtc-12.8.61 | 63.1 MB | #6 | 16%  2025-05-07T20:25:44.3310193Z 2025-05-07T20:25:44.3310197Z 2025-05-07T20:25:44.3310201Z 2025-05-07T20:25:44.3310205Z 2025-05-07T20:25:44.3310209Z 2025-05-07T20:25:44.3310212Z 2025-05-07T20:25:44.3310216Z 2025-05-07T20:25:44.3310220Z 2025-05-07T20:25:44.4049225Z cuda-nvrtc-12.8.61 | 63.1 MB | ##1 | 22%  2025-05-07T20:25:44.4049652Z 2025-05-07T20:25:44.4049656Z 2025-05-07T20:25:44.4049660Z 2025-05-07T20:25:44.4049664Z 2025-05-07T20:25:44.4049667Z 2025-05-07T20:25:44.4049671Z 2025-05-07T20:25:44.4051642Z 2025-05-07T20:25:44.4317835Z cuda-nvvp-12.8.57 | 112.4 MB | ########## | 100%  2025-05-07T20:25:44.4318211Z 2025-05-07T20:25:44.4318223Z 2025-05-07T20:25:44.4318227Z 2025-05-07T20:25:44.4318231Z 2025-05-07T20:25:44.4318234Z 2025-05-07T20:25:44.4318238Z 2025-05-07T20:25:44.4318242Z 2025-05-07T20:25:44.4318245Z 2025-05-07T20:25:44.4537619Z cuda-nvrtc-12.8.61 | 63.1 MB | ##7 | 27%  2025-05-07T20:25:44.4538021Z 2025-05-07T20:25:44.4538027Z 2025-05-07T20:25:44.4538033Z 2025-05-07T20:25:44.4538038Z 2025-05-07T20:25:44.4538043Z 2025-05-07T20:25:44.4538048Z 2025-05-07T20:25:44.4538053Z 2025-05-07T20:25:44.4538058Z 2025-05-07T20:25:44.4538063Z 2025-05-07T20:25:44.5345915Z libcurand-10.3.9.55 | 43.6 MB | | 0%  2025-05-07T20:25:44.5346247Z 2025-05-07T20:25:44.5346251Z 2025-05-07T20:25:44.5346255Z 2025-05-07T20:25:44.5346258Z 2025-05-07T20:25:44.5346262Z 2025-05-07T20:25:44.5346266Z 2025-05-07T20:25:44.5346269Z 2025-05-07T20:25:44.5346273Z 2025-05-07T20:25:44.5548324Z cuda-nvrtc-12.8.61 | 63.1 MB | ###2 | 33%  2025-05-07T20:25:44.5548651Z 2025-05-07T20:25:44.5548655Z 2025-05-07T20:25:44.5548659Z 2025-05-07T20:25:44.5548662Z 2025-05-07T20:25:44.5548666Z 2025-05-07T20:25:44.5548670Z 2025-05-07T20:25:44.5548673Z 2025-05-07T20:25:44.5548677Z 2025-05-07T20:25:44.5548681Z 2025-05-07T20:25:44.6433525Z libcurand-10.3.9.55 | 43.6 MB | 7 | 8%  2025-05-07T20:25:44.6433945Z 2025-05-07T20:25:44.6433952Z 2025-05-07T20:25:44.6433957Z 2025-05-07T20:25:44.6433963Z 2025-05-07T20:25:44.6433968Z 2025-05-07T20:25:44.6433973Z 2025-05-07T20:25:44.6433978Z 2025-05-07T20:25:44.6435783Z 2025-05-07T20:25:44.6651258Z cuda-nvrtc-12.8.61 | 63.1 MB | ###8 | 38%  2025-05-07T20:25:44.6651707Z 2025-05-07T20:25:44.6651714Z 2025-05-07T20:25:44.6651719Z 2025-05-07T20:25:44.6651727Z 2025-05-07T20:25:44.6651731Z 2025-05-07T20:25:44.6651734Z 2025-05-07T20:25:44.6651739Z 2025-05-07T20:25:44.6651743Z 2025-05-07T20:25:44.6653263Z 2025-05-07T20:25:44.7529048Z libcurand-10.3.9.55 | 43.6 MB | #5 | 16%  2025-05-07T20:25:44.7529475Z 2025-05-07T20:25:44.7529482Z 2025-05-07T20:25:44.7529487Z 2025-05-07T20:25:44.7529493Z 2025-05-07T20:25:44.7529498Z 2025-05-07T20:25:44.7529504Z 2025-05-07T20:25:44.7529530Z 2025-05-07T20:25:44.7535890Z 2025-05-07T20:25:44.7804780Z cuda-nvrtc-12.8.61 | 63.1 MB | ####3 | 43%  2025-05-07T20:25:44.7805201Z 2025-05-07T20:25:44.7805208Z 2025-05-07T20:25:44.7805215Z 2025-05-07T20:25:44.7805221Z 2025-05-07T20:25:44.7805228Z 2025-05-07T20:25:44.7805245Z 2025-05-07T20:25:44.7805250Z 2025-05-07T20:25:44.7805255Z 2025-05-07T20:25:44.7805261Z 2025-05-07T20:25:44.7878462Z libcurand-10.3.9.55 | 43.6 MB | ##2 | 23%  2025-05-07T20:25:44.7878872Z 2025-05-07T20:25:44.7878884Z 2025-05-07T20:25:44.7878888Z 2025-05-07T20:25:44.7878892Z 2025-05-07T20:25:44.7878896Z 2025-05-07T20:25:44.8461821Z libnpp-12.3.3.65 | 130.6 MB | ########## | 100%  2025-05-07T20:25:44.8462212Z 2025-05-07T20:25:44.8462217Z 2025-05-07T20:25:44.8462223Z 2025-05-07T20:25:44.8462227Z 2025-05-07T20:25:44.8462231Z 2025-05-07T20:25:44.8462237Z 2025-05-07T20:25:44.8462242Z 2025-05-07T20:25:44.8462247Z 2025-05-07T20:25:44.8462561Z 2025-05-07T20:25:44.8466388Z 2025-05-07T20:25:44.8647179Z gds-tools-1.13.0.11 | 37.9 MB | | 0%  2025-05-07T20:25:44.8647518Z 2025-05-07T20:25:44.8647522Z 2025-05-07T20:25:44.8647526Z 2025-05-07T20:25:44.8647530Z 2025-05-07T20:25:44.8647534Z 2025-05-07T20:25:44.8647537Z 2025-05-07T20:25:44.8647542Z 2025-05-07T20:25:44.8648294Z 2025-05-07T20:25:44.8840593Z cuda-nvrtc-12.8.61 | 63.1 MB | ####8 | 48%  2025-05-07T20:25:44.8840914Z 2025-05-07T20:25:44.8840918Z 2025-05-07T20:25:44.8840922Z 2025-05-07T20:25:44.8840925Z 2025-05-07T20:25:44.8840929Z 2025-05-07T20:25:44.8840933Z 2025-05-07T20:25:44.8840936Z 2025-05-07T20:25:44.8840940Z 2025-05-07T20:25:44.8842500Z 2025-05-07T20:25:44.9462534Z libcurand-10.3.9.55 | 43.6 MB | ##9 | 30%  2025-05-07T20:25:44.9463016Z 2025-05-07T20:25:44.9463023Z 2025-05-07T20:25:44.9463030Z 2025-05-07T20:25:44.9463036Z 2025-05-07T20:25:44.9463065Z 2025-05-07T20:25:44.9463085Z 2025-05-07T20:25:44.9463091Z 2025-05-07T20:25:44.9463096Z 2025-05-07T20:25:44.9463111Z 2025-05-07T20:25:44.9463117Z 2025-05-07T20:25:44.9734836Z gds-tools-1.13.0.11 | 37.9 MB | 7 | 7%  2025-05-07T20:25:44.9735162Z 2025-05-07T20:25:44.9735166Z 2025-05-07T20:25:44.9735170Z 2025-05-07T20:25:44.9735181Z 2025-05-07T20:25:44.9735185Z 2025-05-07T20:25:44.9735188Z 2025-05-07T20:25:44.9735192Z 2025-05-07T20:25:44.9736921Z 2025-05-07T20:25:44.9897158Z cuda-nvrtc-12.8.61 | 63.1 MB | #####3 | 53%  2025-05-07T20:25:44.9897580Z 2025-05-07T20:25:44.9897584Z 2025-05-07T20:25:44.9897588Z 2025-05-07T20:25:44.9897592Z 2025-05-07T20:25:44.9897595Z 2025-05-07T20:25:44.9897600Z 2025-05-07T20:25:44.9897604Z 2025-05-07T20:25:44.9897608Z 2025-05-07T20:25:44.9897611Z 2025-05-07T20:25:45.0464599Z libcurand-10.3.9.55 | 43.6 MB | ###6 | 37%  2025-05-07T20:25:45.0465202Z 2025-05-07T20:25:45.0465225Z 2025-05-07T20:25:45.0465232Z 2025-05-07T20:25:45.0465238Z 2025-05-07T20:25:45.0465243Z 2025-05-07T20:25:45.0465248Z 2025-05-07T20:25:45.0465253Z 2025-05-07T20:25:45.0465259Z 2025-05-07T20:25:45.0465264Z 2025-05-07T20:25:45.0470036Z 2025-05-07T20:25:45.0900076Z gds-tools-1.13.0.11 | 37.9 MB | #4 | 15%  2025-05-07T20:25:45.0900517Z 2025-05-07T20:25:45.0900522Z 2025-05-07T20:25:45.0900525Z 2025-05-07T20:25:45.0900529Z 2025-05-07T20:25:45.0900533Z 2025-05-07T20:25:45.0900536Z 2025-05-07T20:25:45.0900540Z 2025-05-07T20:25:45.0900544Z 2025-05-07T20:25:45.0902752Z 2025-05-07T20:25:45.0930995Z libcurand-10.3.9.55 | 43.6 MB | ####3 | 44%  2025-05-07T20:25:45.0931408Z 2025-05-07T20:25:45.0931413Z 2025-05-07T20:25:45.0931418Z 2025-05-07T20:25:45.0931424Z 2025-05-07T20:25:45.0931429Z 2025-05-07T20:25:45.0931446Z 2025-05-07T20:25:45.0931452Z 2025-05-07T20:25:45.0931457Z 2025-05-07T20:25:45.1581385Z cuda-nvrtc-12.8.61 | 63.1 MB | #####8 | 58%  2025-05-07T20:25:45.1581770Z 2025-05-07T20:25:45.1581786Z 2025-05-07T20:25:45.1581790Z 2025-05-07T20:25:45.1581794Z 2025-05-07T20:25:45.1581927Z 2025-05-07T20:25:45.1581932Z 2025-05-07T20:25:45.1581935Z 2025-05-07T20:25:45.1581939Z 2025-05-07T20:25:45.1581943Z 2025-05-07T20:25:45.1584080Z 2025-05-07T20:25:45.1972078Z gds-tools-1.13.0.11 | 37.9 MB | ##2 | 23%  2025-05-07T20:25:45.1972444Z 2025-05-07T20:25:45.1972449Z 2025-05-07T20:25:45.1972455Z 2025-05-07T20:25:45.1972459Z 2025-05-07T20:25:45.1972464Z 2025-05-07T20:25:45.1972469Z 2025-05-07T20:25:45.1972475Z 2025-05-07T20:25:45.1972480Z 2025-05-07T20:25:45.1972486Z 2025-05-07T20:25:45.1978876Z libcurand-10.3.9.55 | 43.6 MB | ##### | 50%  2025-05-07T20:25:45.1979215Z 2025-05-07T20:25:45.1979525Z 2025-05-07T20:25:45.1979529Z 2025-05-07T20:25:45.1979533Z 2025-05-07T20:25:45.1979773Z 2025-05-07T20:25:45.1979788Z 2025-05-07T20:25:45.1979792Z 2025-05-07T20:25:45.1981024Z 2025-05-07T20:25:45.2584616Z cuda-nvrtc-12.8.61 | 63.1 MB | ######2 | 63%  2025-05-07T20:25:45.2585047Z 2025-05-07T20:25:45.2585052Z 2025-05-07T20:25:45.2585056Z 2025-05-07T20:25:45.2585060Z 2025-05-07T20:25:45.2585063Z 2025-05-07T20:25:45.2585067Z 2025-05-07T20:25:45.2585071Z 2025-05-07T20:25:45.2585075Z 2025-05-07T20:25:45.2585079Z 2025-05-07T20:25:45.2585091Z 2025-05-07T20:25:45.2982498Z gds-tools-1.13.0.11 | 37.9 MB | ### | 31%  2025-05-07T20:25:45.2983048Z 2025-05-07T20:25:45.2983053Z 2025-05-07T20:25:45.2983057Z 2025-05-07T20:25:45.2983062Z 2025-05-07T20:25:45.2983078Z 2025-05-07T20:25:45.2983082Z 2025-05-07T20:25:45.2983086Z 2025-05-07T20:25:45.2987704Z 2025-05-07T20:25:45.3038355Z cuda-nvrtc-12.8.61 | 63.1 MB | ######7 | 67%  2025-05-07T20:25:45.3038678Z 2025-05-07T20:25:45.3038708Z 2025-05-07T20:25:45.3038722Z 2025-05-07T20:25:45.3038726Z 2025-05-07T20:25:45.3038730Z 2025-05-07T20:25:45.3038733Z 2025-05-07T20:25:45.3038737Z 2025-05-07T20:25:45.3038741Z 2025-05-07T20:25:45.3040363Z 2025-05-07T20:25:45.3590885Z libcurand-10.3.9.55 | 43.6 MB | #####6 | 57%  2025-05-07T20:25:45.3591334Z 2025-05-07T20:25:45.3591340Z 2025-05-07T20:25:45.3591346Z 2025-05-07T20:25:45.3591351Z 2025-05-07T20:25:45.3591356Z 2025-05-07T20:25:45.3591361Z 2025-05-07T20:25:45.3591366Z 2025-05-07T20:25:45.3591371Z 2025-05-07T20:25:45.3591376Z 2025-05-07T20:25:45.3591382Z 2025-05-07T20:25:45.4068739Z gds-tools-1.13.0.11 | 37.9 MB | ###9 | 40%  2025-05-07T20:25:45.4069128Z 2025-05-07T20:25:45.4069133Z 2025-05-07T20:25:45.4069137Z 2025-05-07T20:25:45.4069140Z 2025-05-07T20:25:45.4069145Z 2025-05-07T20:25:45.4069148Z 2025-05-07T20:25:45.4069152Z 2025-05-07T20:25:45.4073123Z 2025-05-07T20:25:45.4096164Z cuda-nvrtc-12.8.61 | 63.1 MB | #######1 | 72%  2025-05-07T20:25:45.4096485Z 2025-05-07T20:25:45.4096489Z 2025-05-07T20:25:45.4096493Z 2025-05-07T20:25:45.4096497Z 2025-05-07T20:25:45.4096507Z 2025-05-07T20:25:45.4096510Z 2025-05-07T20:25:45.4096514Z 2025-05-07T20:25:45.4096518Z 2025-05-07T20:25:45.4096521Z 2025-05-07T20:25:45.4653500Z libcurand-10.3.9.55 | 43.6 MB | ######3 | 63%  2025-05-07T20:25:45.4653977Z 2025-05-07T20:25:45.4653982Z 2025-05-07T20:25:45.4653987Z 2025-05-07T20:25:45.4653992Z 2025-05-07T20:25:45.4653997Z 2025-05-07T20:25:45.4654002Z 2025-05-07T20:25:45.4654007Z 2025-05-07T20:25:45.4654013Z 2025-05-07T20:25:45.4654018Z 2025-05-07T20:25:45.4655573Z 2025-05-07T20:25:45.5071708Z gds-tools-1.13.0.11 | 37.9 MB | ####7 | 48%  2025-05-07T20:25:45.5072189Z 2025-05-07T20:25:45.5072197Z 2025-05-07T20:25:45.5072202Z 2025-05-07T20:25:45.5072207Z 2025-05-07T20:25:45.5072213Z 2025-05-07T20:25:45.5072250Z 2025-05-07T20:25:45.5072523Z 2025-05-07T20:25:45.5074617Z 2025-05-07T20:25:45.5097142Z cuda-nvrtc-12.8.61 | 63.1 MB | #######6 | 76%  2025-05-07T20:25:45.5097462Z 2025-05-07T20:25:45.5097466Z 2025-05-07T20:25:45.5097469Z 2025-05-07T20:25:45.5097473Z 2025-05-07T20:25:45.5097477Z 2025-05-07T20:25:45.5097480Z 2025-05-07T20:25:45.5097484Z 2025-05-07T20:25:45.5097487Z 2025-05-07T20:25:45.5097502Z 2025-05-07T20:25:45.5655818Z libcurand-10.3.9.55 | 43.6 MB | ####### | 70%  2025-05-07T20:25:45.5656163Z 2025-05-07T20:25:45.5656167Z 2025-05-07T20:25:45.5656171Z 2025-05-07T20:25:45.5656182Z 2025-05-07T20:25:45.5656186Z 2025-05-07T20:25:45.5656189Z 2025-05-07T20:25:45.5656193Z 2025-05-07T20:25:45.5656196Z 2025-05-07T20:25:45.5656200Z 2025-05-07T20:25:45.5656203Z 2025-05-07T20:25:45.6075432Z gds-tools-1.13.0.11 | 37.9 MB | #####5 | 55%  2025-05-07T20:25:45.6075822Z 2025-05-07T20:25:45.6076113Z 2025-05-07T20:25:45.6076132Z 2025-05-07T20:25:45.6076138Z 2025-05-07T20:25:45.6076143Z 2025-05-07T20:25:45.6076148Z 2025-05-07T20:25:45.6076154Z 2025-05-07T20:25:45.6079916Z 2025-05-07T20:25:45.6102056Z cuda-nvrtc-12.8.61 | 63.1 MB | ########1 | 81%  2025-05-07T20:25:45.6102457Z 2025-05-07T20:25:45.6102461Z 2025-05-07T20:25:45.6102465Z 2025-05-07T20:25:45.6102468Z 2025-05-07T20:25:45.6102472Z 2025-05-07T20:25:45.6102476Z 2025-05-07T20:25:45.6102479Z 2025-05-07T20:25:45.6102483Z 2025-05-07T20:25:45.6105711Z 2025-05-07T20:25:45.6671398Z libcurand-10.3.9.55 | 43.6 MB | #######7 | 77%  2025-05-07T20:25:45.6671722Z 2025-05-07T20:25:45.6671726Z 2025-05-07T20:25:45.6671730Z 2025-05-07T20:25:45.6671734Z 2025-05-07T20:25:45.6671737Z 2025-05-07T20:25:45.6671741Z 2025-05-07T20:25:45.6671745Z 2025-05-07T20:25:45.6671748Z 2025-05-07T20:25:45.6671752Z 2025-05-07T20:25:45.6671756Z 2025-05-07T20:25:45.7092413Z gds-tools-1.13.0.11 | 37.9 MB | ######3 | 63%  2025-05-07T20:25:45.7092828Z 2025-05-07T20:25:45.7092835Z 2025-05-07T20:25:45.7092840Z 2025-05-07T20:25:45.7092845Z 2025-05-07T20:25:45.7092850Z 2025-05-07T20:25:45.7092867Z 2025-05-07T20:25:45.7092873Z 2025-05-07T20:25:45.7092878Z 2025-05-07T20:25:45.7114070Z cuda-nvrtc-12.8.61 | 63.1 MB | ########5 | 86%  2025-05-07T20:25:45.7114516Z 2025-05-07T20:25:45.7114530Z 2025-05-07T20:25:45.7114536Z 2025-05-07T20:25:45.7114542Z 2025-05-07T20:25:45.7114547Z 2025-05-07T20:25:45.7114552Z 2025-05-07T20:25:45.7114557Z 2025-05-07T20:25:45.7114562Z 2025-05-07T20:25:45.7114567Z 2025-05-07T20:25:45.7791602Z libcurand-10.3.9.55 | 43.6 MB | ########4 | 84%  2025-05-07T20:25:45.7792062Z 2025-05-07T20:25:45.7792067Z 2025-05-07T20:25:45.7792071Z 2025-05-07T20:25:45.7792074Z 2025-05-07T20:25:45.7792078Z 2025-05-07T20:25:45.7792082Z 2025-05-07T20:25:45.7792087Z 2025-05-07T20:25:45.7792119Z 2025-05-07T20:25:45.7792131Z 2025-05-07T20:25:45.7792135Z 2025-05-07T20:25:45.8098507Z gds-tools-1.13.0.11 | 37.9 MB | ####### | 71%  2025-05-07T20:25:45.8098837Z 2025-05-07T20:25:45.8098841Z 2025-05-07T20:25:45.8098845Z 2025-05-07T20:25:45.8098849Z 2025-05-07T20:25:45.8098852Z 2025-05-07T20:25:45.8098856Z 2025-05-07T20:25:45.8098860Z 2025-05-07T20:25:45.8098863Z 2025-05-07T20:25:45.8114704Z cuda-nvrtc-12.8.61 | 63.1 MB | ######### | 91%  2025-05-07T20:25:45.8115100Z 2025-05-07T20:25:45.8115104Z 2025-05-07T20:25:45.8115108Z 2025-05-07T20:25:45.8115112Z 2025-05-07T20:25:45.8115115Z 2025-05-07T20:25:45.8115119Z 2025-05-07T20:25:45.8115123Z 2025-05-07T20:25:45.8115126Z 2025-05-07T20:25:45.8115130Z 2025-05-07T20:25:45.8793634Z libcurand-10.3.9.55 | 43.6 MB | #########1 | 91%  2025-05-07T20:25:45.8793958Z 2025-05-07T20:25:45.8793962Z 2025-05-07T20:25:45.8793966Z 2025-05-07T20:25:45.8793969Z 2025-05-07T20:25:45.8794262Z 2025-05-07T20:25:45.8794271Z 2025-05-07T20:25:45.8794286Z 2025-05-07T20:25:45.8794292Z 2025-05-07T20:25:45.8794297Z 2025-05-07T20:25:45.8794302Z 2025-05-07T20:25:45.9105781Z gds-tools-1.13.0.11 | 37.9 MB | #######8 | 79%  2025-05-07T20:25:45.9106210Z 2025-05-07T20:25:45.9106223Z 2025-05-07T20:25:45.9106229Z 2025-05-07T20:25:45.9106234Z 2025-05-07T20:25:45.9106238Z 2025-05-07T20:25:45.9106243Z 2025-05-07T20:25:45.9106248Z 2025-05-07T20:25:45.9106253Z 2025-05-07T20:25:45.9172314Z cuda-nvrtc-12.8.61 | 63.1 MB | #########5 | 95%  2025-05-07T20:25:45.9172622Z 2025-05-07T20:25:45.9172626Z 2025-05-07T20:25:45.9172630Z 2025-05-07T20:25:45.9172634Z 2025-05-07T20:25:45.9172637Z 2025-05-07T20:25:45.9172641Z 2025-05-07T20:25:45.9172645Z 2025-05-07T20:25:45.9172648Z 2025-05-07T20:25:45.9176212Z 2025-05-07T20:25:45.9796853Z libcurand-10.3.9.55 | 43.6 MB | #########8 | 98%  2025-05-07T20:25:45.9797566Z 2025-05-07T20:25:45.9797572Z 2025-05-07T20:25:45.9797578Z 2025-05-07T20:25:45.9797583Z 2025-05-07T20:25:45.9797587Z 2025-05-07T20:25:45.9797592Z 2025-05-07T20:25:45.9797597Z 2025-05-07T20:25:45.9797603Z 2025-05-07T20:25:45.9797608Z 2025-05-07T20:25:45.9800674Z 2025-05-07T20:25:46.0799559Z gds-tools-1.13.0.11 | 37.9 MB | ########6 | 86%  2025-05-07T20:25:46.0799898Z 2025-05-07T20:25:46.0799902Z 2025-05-07T20:25:46.0799905Z 2025-05-07T20:25:46.0799909Z 2025-05-07T20:25:46.0799912Z 2025-05-07T20:25:46.0799916Z 2025-05-07T20:25:46.0799920Z 2025-05-07T20:25:46.0799939Z 2025-05-07T20:25:46.0799943Z 2025-05-07T20:25:46.0800912Z 2025-05-07T20:25:47.2803074Z gds-tools-1.13.0.11 | 37.9 MB | #########4 | 94%  2025-05-07T20:25:47.2803518Z 2025-05-07T20:25:47.2803523Z 2025-05-07T20:25:47.2806018Z 2025-05-07T20:25:47.3180464Z libcusolver-11.7.2.5 | 156.9 MB | ########## | 100%  2025-05-07T20:25:47.3180911Z 2025-05-07T20:25:47.3180930Z 2025-05-07T20:25:47.3180936Z 2025-05-07T20:25:47.3180941Z 2025-05-07T20:25:47.3180946Z 2025-05-07T20:25:47.3180952Z 2025-05-07T20:25:47.3180957Z 2025-05-07T20:25:47.3180962Z 2025-05-07T20:25:47.3180967Z 2025-05-07T20:25:47.3180972Z 2025-05-07T20:25:47.3585946Z gds-tools-1.13.0.11 | 37.9 MB | ########## | 100%  2025-05-07T20:25:47.3586381Z 2025-05-07T20:25:47.3586387Z 2025-05-07T20:25:47.3586391Z 2025-05-07T20:25:47.3586397Z 2025-05-07T20:25:47.3586402Z 2025-05-07T20:25:47.3586407Z 2025-05-07T20:25:47.3586412Z 2025-05-07T20:25:47.3586417Z 2025-05-07T20:25:47.3586422Z 2025-05-07T20:25:47.3586427Z 2025-05-07T20:25:47.3586432Z 2025-05-07T20:25:47.4295863Z libnvjitlink-12.8.61 | 28.7 MB | | 0%  2025-05-07T20:25:47.4296309Z 2025-05-07T20:25:47.4296315Z 2025-05-07T20:25:47.4296320Z 2025-05-07T20:25:47.4296325Z 2025-05-07T20:25:47.4296330Z 2025-05-07T20:25:47.4296335Z 2025-05-07T20:25:47.4296373Z 2025-05-07T20:25:47.4296391Z 2025-05-07T20:25:47.4296398Z 2025-05-07T20:25:47.4587508Z libcurand-10.3.9.55 | 43.6 MB | ########## | 100%  2025-05-07T20:25:47.4587933Z 2025-05-07T20:25:47.4587947Z 2025-05-07T20:25:47.4587952Z 2025-05-07T20:25:47.4587957Z 2025-05-07T20:25:47.4587963Z 2025-05-07T20:25:47.4587968Z 2025-05-07T20:25:47.4587973Z 2025-05-07T20:25:47.4587978Z 2025-05-07T20:25:47.4587983Z 2025-05-07T20:25:47.4587988Z 2025-05-07T20:25:47.4587993Z 2025-05-07T20:25:47.5065403Z libnvjitlink-12.8.61 | 28.7 MB | # | 11%  2025-05-07T20:25:47.5065862Z 2025-05-07T20:25:47.5065868Z 2025-05-07T20:25:47.5065874Z 2025-05-07T20:25:47.5065879Z 2025-05-07T20:25:47.5065884Z 2025-05-07T20:25:47.5065889Z 2025-05-07T20:25:47.5065894Z 2025-05-07T20:25:47.5065899Z 2025-05-07T20:25:47.5065904Z 2025-05-07T20:25:47.5065910Z 2025-05-07T20:25:47.5065915Z 2025-05-07T20:25:47.5066613Z 2025-05-07T20:25:47.5588831Z cuda-nvcc-tools-12.8 | 24.5 MB | | 0%  2025-05-07T20:25:47.5589289Z 2025-05-07T20:25:47.5589293Z 2025-05-07T20:25:47.5589297Z 2025-05-07T20:25:47.5589301Z 2025-05-07T20:25:47.5589304Z 2025-05-07T20:25:47.5589308Z 2025-05-07T20:25:47.5589312Z 2025-05-07T20:25:47.5589316Z 2025-05-07T20:25:47.5589319Z 2025-05-07T20:25:47.5589331Z 2025-05-07T20:25:47.5589335Z 2025-05-07T20:25:47.6066720Z libnvjitlink-12.8.61 | 28.7 MB | ##2 | 22%  2025-05-07T20:25:47.6067168Z 2025-05-07T20:25:47.6067174Z 2025-05-07T20:25:47.6067187Z 2025-05-07T20:25:47.6067192Z 2025-05-07T20:25:47.6067197Z 2025-05-07T20:25:47.6067202Z 2025-05-07T20:25:47.6067209Z 2025-05-07T20:25:47.6067214Z 2025-05-07T20:25:47.6067219Z 2025-05-07T20:25:47.6067224Z 2025-05-07T20:25:47.6067229Z 2025-05-07T20:25:47.6068762Z 2025-05-07T20:25:47.6729012Z cuda-nvcc-tools-12.8 | 24.5 MB | #2 | 12%  2025-05-07T20:25:47.6729836Z 2025-05-07T20:25:47.6729841Z 2025-05-07T20:25:47.6729845Z 2025-05-07T20:25:47.6729849Z 2025-05-07T20:25:47.6729852Z 2025-05-07T20:25:47.6729856Z 2025-05-07T20:25:47.6729860Z 2025-05-07T20:25:47.6729864Z 2025-05-07T20:25:47.6729867Z 2025-05-07T20:25:47.6729871Z 2025-05-07T20:25:47.6729875Z 2025-05-07T20:25:47.7141947Z libnvjitlink-12.8.61 | 28.7 MB | ###3 | 34%  2025-05-07T20:25:47.7142410Z 2025-05-07T20:25:47.7142416Z 2025-05-07T20:25:47.7142421Z 2025-05-07T20:25:47.7142427Z 2025-05-07T20:25:47.7142432Z 2025-05-07T20:25:47.7142437Z 2025-05-07T20:25:47.7142443Z 2025-05-07T20:25:47.7142461Z 2025-05-07T20:25:47.7142467Z 2025-05-07T20:25:47.7142472Z 2025-05-07T20:25:47.7142477Z 2025-05-07T20:25:47.7142483Z 2025-05-07T20:25:47.7794774Z cuda-nvcc-tools-12.8 | 24.5 MB | ##4 | 24%  2025-05-07T20:25:47.7795236Z 2025-05-07T20:25:47.7795242Z 2025-05-07T20:25:47.7795248Z 2025-05-07T20:25:47.7795273Z 2025-05-07T20:25:47.7795290Z 2025-05-07T20:25:47.7795296Z 2025-05-07T20:25:47.7795301Z 2025-05-07T20:25:47.7795306Z 2025-05-07T20:25:47.7795311Z 2025-05-07T20:25:47.7795316Z 2025-05-07T20:25:47.7795892Z 2025-05-07T20:25:47.8143629Z libnvjitlink-12.8.61 | 28.7 MB | ####4 | 44%  2025-05-07T20:25:47.8144086Z 2025-05-07T20:25:47.8144092Z 2025-05-07T20:25:47.8144097Z 2025-05-07T20:25:47.8144102Z 2025-05-07T20:25:47.8144118Z 2025-05-07T20:25:47.8144123Z 2025-05-07T20:25:47.8144128Z 2025-05-07T20:25:47.8144133Z 2025-05-07T20:25:47.8144138Z 2025-05-07T20:25:47.8144143Z 2025-05-07T20:25:47.8144148Z 2025-05-07T20:25:47.8144555Z 2025-05-07T20:25:47.8408447Z cuda-nvcc-tools-12.8 | 24.5 MB | ###5 | 36%  2025-05-07T20:25:47.8408913Z 2025-05-07T20:25:47.8408918Z 2025-05-07T20:25:47.8408923Z 2025-05-07T20:25:47.8408929Z 2025-05-07T20:25:47.8408934Z 2025-05-07T20:25:47.8408939Z 2025-05-07T20:25:47.8562469Z cuda-nsight-12.8.55 | 113.2 MB | ########## | 100%  2025-05-07T20:25:47.8564954Z 2025-05-07T20:25:47.8875706Z nsight-compute-2025. | 320.6 MB | ########## | 100%  2025-05-07T20:25:47.8875995Z 2025-05-07T20:25:47.8875999Z 2025-05-07T20:25:47.8876002Z 2025-05-07T20:25:47.8876006Z 2025-05-07T20:25:47.8876010Z 2025-05-07T20:25:47.8876013Z 2025-05-07T20:25:47.8876018Z 2025-05-07T20:25:47.8876022Z 2025-05-07T20:25:47.8876025Z 2025-05-07T20:25:47.8876029Z 2025-05-07T20:25:47.8877118Z 2025-05-07T20:25:47.9016707Z libnvjitlink-12.8.61 | 28.7 MB | #####4 | 55%  2025-05-07T20:25:47.9017704Z 2025-05-07T20:25:47.9017710Z 2025-05-07T20:25:47.9017715Z 2025-05-07T20:25:47.9017721Z 2025-05-07T20:25:47.9017727Z 2025-05-07T20:25:47.9017732Z 2025-05-07T20:25:47.9017737Z 2025-05-07T20:25:47.9017742Z 2025-05-07T20:25:47.9017748Z 2025-05-07T20:25:47.9017753Z 2025-05-07T20:25:47.9017758Z 2025-05-07T20:25:47.9017764Z 2025-05-07T20:25:47.9017788Z 2025-05-07T20:25:47.9178715Z cuda-nvvm-tools-12.8 | 23.5 MB | | 0%  2025-05-07T20:25:47.9179191Z 2025-05-07T20:25:47.9179197Z 2025-05-07T20:25:47.9179202Z 2025-05-07T20:25:47.9179218Z 2025-05-07T20:25:47.9179223Z 2025-05-07T20:25:47.9179228Z 2025-05-07T20:25:47.9179233Z 2025-05-07T20:25:47.9179238Z 2025-05-07T20:25:47.9179243Z 2025-05-07T20:25:47.9179248Z 2025-05-07T20:25:47.9179253Z 2025-05-07T20:25:47.9179258Z 2025-05-07T20:25:47.9254754Z cuda-nvcc-tools-12.8 | 24.5 MB | ####7 | 48%  2025-05-07T20:25:47.9255216Z 2025-05-07T20:25:47.9255222Z 2025-05-07T20:25:47.9255227Z 2025-05-07T20:25:47.9255232Z 2025-05-07T20:25:47.9255237Z 2025-05-07T20:25:47.9255242Z 2025-05-07T20:25:47.9255247Z 2025-05-07T20:25:47.9255252Z 2025-05-07T20:25:47.9851161Z cuda-nvrtc-12.8.61 | 63.1 MB | ########## | 100%  2025-05-07T20:25:47.9851593Z 2025-05-07T20:25:47.9852943Z 2025-05-07T20:25:47.9900567Z libcusparse-12.5.7.5 | 164.9 MB | ########## | 100%  2025-05-07T20:25:47.9900968Z 2025-05-07T20:25:47.9900974Z 2025-05-07T20:25:47.9900979Z 2025-05-07T20:25:47.9900984Z 2025-05-07T20:25:47.9900990Z 2025-05-07T20:25:47.9900995Z 2025-05-07T20:25:47.9901000Z 2025-05-07T20:25:47.9901005Z 2025-05-07T20:25:47.9901011Z 2025-05-07T20:25:47.9901016Z 2025-05-07T20:25:47.9901021Z 2025-05-07T20:25:47.9995351Z libnvjitlink-12.8.61 | 28.7 MB | ######4 | 65%  2025-05-07T20:25:47.9995795Z 2025-05-07T20:25:47.9995800Z 2025-05-07T20:25:47.9995805Z 2025-05-07T20:25:47.9995811Z 2025-05-07T20:25:47.9995816Z 2025-05-07T20:25:47.9995821Z 2025-05-07T20:25:47.9995826Z 2025-05-07T20:25:47.9995831Z 2025-05-07T20:25:47.9995837Z 2025-05-07T20:25:47.9995842Z 2025-05-07T20:25:47.9995847Z 2025-05-07T20:25:47.9995852Z 2025-05-07T20:25:47.9995870Z 2025-05-07T20:25:47.9995875Z 2025-05-07T20:25:48.0016994Z python-3.9.18 | 22.7 MB | | 0%  2025-05-07T20:25:48.0017653Z 2025-05-07T20:25:48.0017659Z 2025-05-07T20:25:48.0017673Z 2025-05-07T20:25:48.0017678Z 2025-05-07T20:25:48.0017683Z 2025-05-07T20:25:48.0017688Z 2025-05-07T20:25:48.0017693Z 2025-05-07T20:25:48.0017698Z 2025-05-07T20:25:48.0017703Z 2025-05-07T20:25:48.0017709Z 2025-05-07T20:25:48.0017713Z 2025-05-07T20:25:48.0017719Z 2025-05-07T20:25:48.0017723Z 2025-05-07T20:25:48.0404328Z cuda-nvvm-tools-12.8 | 23.5 MB | # | 11%  2025-05-07T20:25:48.0404803Z 2025-05-07T20:25:48.0404809Z 2025-05-07T20:25:48.0404814Z 2025-05-07T20:25:48.0404819Z 2025-05-07T20:25:48.0404825Z 2025-05-07T20:25:48.0404829Z 2025-05-07T20:25:48.0404835Z 2025-05-07T20:25:48.0404840Z 2025-05-07T20:25:48.0404845Z 2025-05-07T20:25:48.0404850Z 2025-05-07T20:25:48.0404855Z 2025-05-07T20:25:48.0409848Z 2025-05-07T20:25:48.1039642Z cuda-nvcc-tools-12.8 | 24.5 MB | #####9 | 59%  2025-05-07T20:25:48.1040163Z 2025-05-07T20:25:48.1040180Z 2025-05-07T20:25:48.1040185Z 2025-05-07T20:25:48.1040190Z 2025-05-07T20:25:48.1040196Z 2025-05-07T20:25:48.1040201Z 2025-05-07T20:25:48.1040206Z 2025-05-07T20:25:48.1040211Z 2025-05-07T20:25:48.1040216Z 2025-05-07T20:25:48.1040222Z 2025-05-07T20:25:48.1040995Z 2025-05-07T20:25:48.1067174Z libnvjitlink-12.8.61 | 28.7 MB | #######4 | 75%  2025-05-07T20:25:48.1067624Z 2025-05-07T20:25:48.1067630Z 2025-05-07T20:25:48.1067635Z 2025-05-07T20:25:48.1067641Z 2025-05-07T20:25:48.1067646Z 2025-05-07T20:25:48.1067651Z 2025-05-07T20:25:48.1067656Z 2025-05-07T20:25:48.1067661Z 2025-05-07T20:25:48.1067675Z 2025-05-07T20:25:48.1067680Z 2025-05-07T20:25:48.1067685Z 2025-05-07T20:25:48.1067691Z 2025-05-07T20:25:48.1067696Z 2025-05-07T20:25:48.1069295Z 2025-05-07T20:25:48.1078379Z python-3.9.18 | 22.7 MB | 6 | 6%  2025-05-07T20:25:48.1078810Z 2025-05-07T20:25:48.1079048Z 2025-05-07T20:25:48.1079054Z 2025-05-07T20:25:48.1079058Z 2025-05-07T20:25:48.1079062Z 2025-05-07T20:25:48.1079065Z 2025-05-07T20:25:48.1079069Z 2025-05-07T20:25:48.1079073Z 2025-05-07T20:25:48.1079076Z 2025-05-07T20:25:48.1079080Z 2025-05-07T20:25:48.1079084Z 2025-05-07T20:25:48.1079087Z 2025-05-07T20:25:48.1080485Z 2025-05-07T20:25:48.1498358Z cuda-nvvm-tools-12.8 | 23.5 MB | ##1 | 21%  2025-05-07T20:25:48.1498827Z 2025-05-07T20:25:48.1498833Z 2025-05-07T20:25:48.1498838Z 2025-05-07T20:25:48.1498843Z 2025-05-07T20:25:48.1498849Z 2025-05-07T20:25:48.1498854Z 2025-05-07T20:25:48.1498859Z 2025-05-07T20:25:48.1498864Z 2025-05-07T20:25:48.1498869Z 2025-05-07T20:25:48.1498874Z 2025-05-07T20:25:48.1498880Z 2025-05-07T20:25:48.1498896Z 2025-05-07T20:25:48.2074656Z cuda-nvcc-tools-12.8 | 24.5 MB | ######9 | 70%  2025-05-07T20:25:48.2075120Z 2025-05-07T20:25:48.2075398Z 2025-05-07T20:25:48.2075418Z 2025-05-07T20:25:48.2075435Z 2025-05-07T20:25:48.2075440Z 2025-05-07T20:25:48.2075445Z 2025-05-07T20:25:48.2075451Z 2025-05-07T20:25:48.2075456Z 2025-05-07T20:25:48.2075461Z 2025-05-07T20:25:48.2075466Z 2025-05-07T20:25:48.2075471Z 2025-05-07T20:25:48.2075476Z 2025-05-07T20:25:48.2075481Z 2025-05-07T20:25:48.2076151Z 2025-05-07T20:25:48.2166898Z python-3.9.18 | 22.7 MB | #6 | 16%  2025-05-07T20:25:48.2167331Z 2025-05-07T20:25:48.2167337Z 2025-05-07T20:25:48.2167342Z 2025-05-07T20:25:48.2167347Z 2025-05-07T20:25:48.2167353Z 2025-05-07T20:25:48.2167358Z 2025-05-07T20:25:48.2167363Z 2025-05-07T20:25:48.2167368Z 2025-05-07T20:25:48.2167373Z 2025-05-07T20:25:48.2167379Z 2025-05-07T20:25:48.2170709Z 2025-05-07T20:25:48.2171780Z libnvjitlink-12.8.61 | 28.7 MB | ########4 | 84%  2025-05-07T20:25:48.2172216Z 2025-05-07T20:25:48.2172222Z 2025-05-07T20:25:48.2172243Z 2025-05-07T20:25:48.2172254Z 2025-05-07T20:25:48.2172270Z 2025-05-07T20:25:48.2172276Z 2025-05-07T20:25:48.2172281Z 2025-05-07T20:25:48.2172287Z 2025-05-07T20:25:48.2172292Z 2025-05-07T20:25:48.2172297Z 2025-05-07T20:25:48.2172303Z 2025-05-07T20:25:48.2172308Z 2025-05-07T20:25:48.2172566Z 2025-05-07T20:25:48.2611729Z cuda-nvvm-tools-12.8 | 23.5 MB | ###1 | 32%  2025-05-07T20:25:48.2612190Z 2025-05-07T20:25:48.2612206Z 2025-05-07T20:25:48.2612211Z 2025-05-07T20:25:48.2612216Z 2025-05-07T20:25:48.2612221Z 2025-05-07T20:25:48.2612226Z 2025-05-07T20:25:48.2612231Z 2025-05-07T20:25:48.2612236Z 2025-05-07T20:25:48.2612241Z 2025-05-07T20:25:48.2612246Z 2025-05-07T20:25:48.2612252Z 2025-05-07T20:25:48.2612257Z 2025-05-07T20:25:48.3077359Z cuda-nvcc-tools-12.8 | 24.5 MB | ######## | 80%  2025-05-07T20:25:48.3077812Z 2025-05-07T20:25:48.3077817Z 2025-05-07T20:25:48.3077823Z 2025-05-07T20:25:48.3077852Z 2025-05-07T20:25:48.3077868Z 2025-05-07T20:25:48.3077874Z 2025-05-07T20:25:48.3077879Z 2025-05-07T20:25:48.3077883Z 2025-05-07T20:25:48.3077889Z 2025-05-07T20:25:48.3077893Z 2025-05-07T20:25:48.3077911Z 2025-05-07T20:25:48.3077917Z 2025-05-07T20:25:48.3077922Z 2025-05-07T20:25:48.3085553Z 2025-05-07T20:25:48.3255765Z python-3.9.18 | 22.7 MB | ##7 | 27%  2025-05-07T20:25:48.3256217Z 2025-05-07T20:25:48.3256224Z 2025-05-07T20:25:48.3256230Z 2025-05-07T20:25:48.3256236Z 2025-05-07T20:25:48.3256240Z 2025-05-07T20:25:48.3256246Z 2025-05-07T20:25:48.3256251Z 2025-05-07T20:25:48.3256256Z 2025-05-07T20:25:48.3256261Z 2025-05-07T20:25:48.3256266Z 2025-05-07T20:25:48.3256271Z 2025-05-07T20:25:48.3256276Z 2025-05-07T20:25:48.3259759Z 2025-05-07T20:25:48.3392643Z cuda-nvvm-tools-12.8 | 23.5 MB | ####1 | 42%  2025-05-07T20:25:48.3393121Z 2025-05-07T20:25:48.3393127Z 2025-05-07T20:25:48.3393148Z 2025-05-07T20:25:48.3393396Z 2025-05-07T20:25:48.3393403Z 2025-05-07T20:25:48.3393409Z 2025-05-07T20:25:48.3393414Z 2025-05-07T20:25:48.3393419Z 2025-05-07T20:25:48.3393424Z 2025-05-07T20:25:48.3393429Z 2025-05-07T20:25:48.3393435Z 2025-05-07T20:25:48.3651312Z libnvjitlink-12.8.61 | 28.7 MB | #########3 | 94%  2025-05-07T20:25:48.3651763Z 2025-05-07T20:25:48.3651769Z 2025-05-07T20:25:48.3651774Z 2025-05-07T20:25:48.3651780Z 2025-05-07T20:25:48.3651785Z 2025-05-07T20:25:48.3651790Z 2025-05-07T20:25:48.3651804Z 2025-05-07T20:25:48.3651809Z 2025-05-07T20:25:48.3651814Z 2025-05-07T20:25:48.3651819Z 2025-05-07T20:25:48.3651824Z 2025-05-07T20:25:48.3663443Z 2025-05-07T20:25:48.4089014Z cuda-nvcc-tools-12.8 | 24.5 MB | ######### | 90%  2025-05-07T20:25:48.4089524Z 2025-05-07T20:25:48.4089530Z 2025-05-07T20:25:48.4089535Z 2025-05-07T20:25:48.4089540Z 2025-05-07T20:25:48.4089546Z 2025-05-07T20:25:48.4089551Z 2025-05-07T20:25:48.4089842Z 2025-05-07T20:25:48.4089848Z 2025-05-07T20:25:48.4089853Z 2025-05-07T20:25:48.4089858Z 2025-05-07T20:25:48.4089863Z 2025-05-07T20:25:48.4089868Z 2025-05-07T20:25:48.4089874Z 2025-05-07T20:25:48.4089896Z 2025-05-07T20:25:48.4259510Z python-3.9.18 | 22.7 MB | ###8 | 39%  2025-05-07T20:25:48.4259945Z 2025-05-07T20:25:48.4259950Z 2025-05-07T20:25:48.4259955Z 2025-05-07T20:25:48.4259960Z 2025-05-07T20:25:48.4259965Z 2025-05-07T20:25:48.4259970Z 2025-05-07T20:25:48.4259975Z 2025-05-07T20:25:48.4259980Z 2025-05-07T20:25:48.4259985Z 2025-05-07T20:25:48.4259990Z 2025-05-07T20:25:48.4259995Z 2025-05-07T20:25:48.4260000Z 2025-05-07T20:25:48.4266486Z 2025-05-07T20:25:48.5094378Z cuda-nvvm-tools-12.8 | 23.5 MB | #####1 | 52%  2025-05-07T20:25:48.5094828Z 2025-05-07T20:25:48.5094833Z 2025-05-07T20:25:48.5094838Z 2025-05-07T20:25:48.5094844Z 2025-05-07T20:25:48.5094850Z 2025-05-07T20:25:48.5094890Z 2025-05-07T20:25:48.5094909Z 2025-05-07T20:25:48.5094914Z 2025-05-07T20:25:48.5094919Z 2025-05-07T20:25:48.5094924Z 2025-05-07T20:25:48.5094930Z 2025-05-07T20:25:48.5094935Z 2025-05-07T20:25:48.5094940Z 2025-05-07T20:25:48.5096280Z 2025-05-07T20:25:48.5264404Z python-3.9.18 | 22.7 MB | ##### | 51%  2025-05-07T20:25:48.5264826Z 2025-05-07T20:25:48.5264832Z 2025-05-07T20:25:48.5264837Z 2025-05-07T20:25:48.5264842Z 2025-05-07T20:25:48.5264847Z 2025-05-07T20:25:48.5264853Z 2025-05-07T20:25:48.5264858Z 2025-05-07T20:25:48.5264863Z 2025-05-07T20:25:48.5264868Z 2025-05-07T20:25:48.5264873Z 2025-05-07T20:25:48.5264878Z 2025-05-07T20:25:48.5264883Z 2025-05-07T20:25:48.5266923Z 2025-05-07T20:25:48.6095843Z cuda-nvvm-tools-12.8 | 23.5 MB | ######3 | 63%  2025-05-07T20:25:48.6096304Z 2025-05-07T20:25:48.6096320Z 2025-05-07T20:25:48.6096325Z 2025-05-07T20:25:48.6096350Z 2025-05-07T20:25:48.6096365Z 2025-05-07T20:25:48.6096371Z 2025-05-07T20:25:48.6096376Z 2025-05-07T20:25:48.6096382Z 2025-05-07T20:25:48.6096387Z 2025-05-07T20:25:48.6096403Z 2025-05-07T20:25:48.6096408Z 2025-05-07T20:25:48.6096413Z 2025-05-07T20:25:48.6096419Z 2025-05-07T20:25:48.6097804Z 2025-05-07T20:25:48.6270517Z python-3.9.18 | 22.7 MB | ######2 | 62%  2025-05-07T20:25:48.6270949Z 2025-05-07T20:25:48.6270955Z 2025-05-07T20:25:48.6270960Z 2025-05-07T20:25:48.6270965Z 2025-05-07T20:25:48.6270970Z 2025-05-07T20:25:48.6270975Z 2025-05-07T20:25:48.6270980Z 2025-05-07T20:25:48.6270985Z 2025-05-07T20:25:48.6270990Z 2025-05-07T20:25:48.6270996Z 2025-05-07T20:25:48.6271001Z 2025-05-07T20:25:48.6271007Z 2025-05-07T20:25:48.6272799Z 2025-05-07T20:25:48.7110271Z cuda-nvvm-tools-12.8 | 23.5 MB | #######5 | 76%  2025-05-07T20:25:48.7110721Z 2025-05-07T20:25:48.7110726Z 2025-05-07T20:25:48.7110748Z 2025-05-07T20:25:48.7111018Z 2025-05-07T20:25:48.7111025Z 2025-05-07T20:25:48.7111030Z 2025-05-07T20:25:48.7111036Z 2025-05-07T20:25:48.7111041Z 2025-05-07T20:25:48.7111046Z 2025-05-07T20:25:48.7111051Z 2025-05-07T20:25:48.7111056Z 2025-05-07T20:25:48.7111061Z 2025-05-07T20:25:48.7111078Z 2025-05-07T20:25:48.7111089Z 2025-05-07T20:25:48.7271025Z python-3.9.18 | 22.7 MB | #######3 | 73%  2025-05-07T20:25:48.7271435Z 2025-05-07T20:25:48.7271441Z 2025-05-07T20:25:48.7271458Z 2025-05-07T20:25:48.7271464Z 2025-05-07T20:25:48.7271469Z 2025-05-07T20:25:48.7271474Z 2025-05-07T20:25:48.7271479Z 2025-05-07T20:25:48.7271484Z 2025-05-07T20:25:48.7271489Z 2025-05-07T20:25:48.7271494Z 2025-05-07T20:25:48.7271499Z 2025-05-07T20:25:48.7271504Z 2025-05-07T20:25:48.7271514Z 2025-05-07T20:25:48.8116942Z cuda-nvvm-tools-12.8 | 23.5 MB | ########7 | 88%  2025-05-07T20:25:48.8117416Z 2025-05-07T20:25:48.8117675Z 2025-05-07T20:25:48.8117693Z 2025-05-07T20:25:48.8117698Z 2025-05-07T20:25:48.8117703Z 2025-05-07T20:25:48.8117709Z 2025-05-07T20:25:48.8117714Z 2025-05-07T20:25:48.8117719Z 2025-05-07T20:25:48.8117724Z 2025-05-07T20:25:48.8117729Z 2025-05-07T20:25:48.8117734Z 2025-05-07T20:25:48.8117739Z 2025-05-07T20:25:48.8117744Z 2025-05-07T20:25:48.8117749Z 2025-05-07T20:25:48.9591753Z python-3.9.18 | 22.7 MB | ########5 | 86%  2025-05-07T20:25:48.9592171Z 2025-05-07T20:25:48.9592177Z 2025-05-07T20:25:48.9592182Z 2025-05-07T20:25:48.9592197Z 2025-05-07T20:25:48.9592213Z 2025-05-07T20:25:48.9592219Z 2025-05-07T20:25:48.9592224Z 2025-05-07T20:25:48.9592229Z 2025-05-07T20:25:48.9592235Z 2025-05-07T20:25:48.9592240Z 2025-05-07T20:25:48.9592245Z 2025-05-07T20:25:48.9592250Z 2025-05-07T20:25:48.9592256Z 2025-05-07T20:25:48.9592261Z 2025-05-07T20:25:48.9896351Z python-3.9.18 | 22.7 MB | #########7 | 97%  2025-05-07T20:25:48.9896804Z 2025-05-07T20:25:48.9896823Z 2025-05-07T20:25:48.9896828Z 2025-05-07T20:25:48.9896833Z 2025-05-07T20:25:48.9896838Z 2025-05-07T20:25:48.9896843Z 2025-05-07T20:25:48.9896849Z 2025-05-07T20:25:48.9896855Z 2025-05-07T20:25:48.9896860Z 2025-05-07T20:25:48.9896865Z 2025-05-07T20:25:48.9896870Z 2025-05-07T20:25:48.9896874Z 2025-05-07T20:25:48.9896881Z 2025-05-07T20:25:49.2996162Z cuda-nvvm-tools-12.8 | 23.5 MB | #########9 | 99%  2025-05-07T20:25:49.2996646Z 2025-05-07T20:25:49.2996652Z 2025-05-07T20:25:49.2996658Z 2025-05-07T20:25:49.2996663Z 2025-05-07T20:25:49.2996681Z 2025-05-07T20:25:49.2996685Z 2025-05-07T20:25:49.2996689Z 2025-05-07T20:25:49.2996693Z 2025-05-07T20:25:49.2996697Z 2025-05-07T20:25:49.2996700Z 2025-05-07T20:25:49.2996704Z 2025-05-07T20:25:49.2996708Z 2025-05-07T20:25:49.3524110Z cuda-nvcc-tools-12.8 | 24.5 MB | ########## | 100%  2025-05-07T20:25:49.3524598Z 2025-05-07T20:25:49.3524629Z 2025-05-07T20:25:49.3524634Z 2025-05-07T20:25:49.3524639Z 2025-05-07T20:25:49.3524644Z 2025-05-07T20:25:49.3524649Z 2025-05-07T20:25:49.3524653Z 2025-05-07T20:25:49.3524659Z 2025-05-07T20:25:49.3524663Z 2025-05-07T20:25:49.3524669Z 2025-05-07T20:25:49.3524674Z 2025-05-07T20:25:49.3754827Z libnvjitlink-12.8.61 | 28.7 MB | ########## | 100%  2025-05-07T20:25:49.3755277Z 2025-05-07T20:25:49.3755283Z 2025-05-07T20:25:49.3755289Z 2025-05-07T20:25:49.3755294Z 2025-05-07T20:25:49.3755299Z 2025-05-07T20:25:49.3755304Z 2025-05-07T20:25:49.3755310Z 2025-05-07T20:25:49.3755315Z 2025-05-07T20:25:49.3755320Z 2025-05-07T20:25:49.3755325Z 2025-05-07T20:25:49.3755330Z 2025-05-07T20:25:49.3755335Z 2025-05-07T20:25:49.3755341Z 2025-05-07T20:25:49.3755346Z 2025-05-07T20:25:49.3758363Z 2025-05-07T20:25:49.3935541Z cuda-nvvm-impl-12.8. | 20.8 MB | | 0%  2025-05-07T20:25:49.3936054Z 2025-05-07T20:25:49.3936363Z 2025-05-07T20:25:49.3936370Z 2025-05-07T20:25:49.3936375Z 2025-05-07T20:25:49.3936380Z 2025-05-07T20:25:49.3936386Z 2025-05-07T20:25:49.3936391Z 2025-05-07T20:25:49.3936408Z 2025-05-07T20:25:49.3936413Z 2025-05-07T20:25:49.3936418Z 2025-05-07T20:25:49.3936423Z 2025-05-07T20:25:49.3936428Z 2025-05-07T20:25:49.3936433Z 2025-05-07T20:25:49.3936439Z 2025-05-07T20:25:49.3936444Z 2025-05-07T20:25:49.3936449Z 2025-05-07T20:25:49.4758128Z cuda-nvcc-dev_linux- | 12.7 MB | | 0%  2025-05-07T20:25:49.4758607Z 2025-05-07T20:25:49.4758623Z 2025-05-07T20:25:49.4758628Z 2025-05-07T20:25:49.4758633Z 2025-05-07T20:25:49.4758638Z 2025-05-07T20:25:49.4758644Z 2025-05-07T20:25:49.4758649Z 2025-05-07T20:25:49.4758654Z 2025-05-07T20:25:49.4758660Z 2025-05-07T20:25:49.4758665Z 2025-05-07T20:25:49.4758670Z 2025-05-07T20:25:49.4758675Z 2025-05-07T20:25:49.4758682Z 2025-05-07T20:25:49.4758687Z 2025-05-07T20:25:49.4760725Z 2025-05-07T20:25:49.4940877Z cuda-nvvm-impl-12.8. | 20.8 MB | #4 | 14%  2025-05-07T20:25:49.4941354Z 2025-05-07T20:25:49.4941360Z 2025-05-07T20:25:49.4941365Z 2025-05-07T20:25:49.4941370Z 2025-05-07T20:25:49.4941375Z 2025-05-07T20:25:49.4941380Z 2025-05-07T20:25:49.4941385Z 2025-05-07T20:25:49.4941390Z 2025-05-07T20:25:49.4941396Z 2025-05-07T20:25:49.4941560Z 2025-05-07T20:25:49.4941567Z 2025-05-07T20:25:49.4941572Z 2025-05-07T20:25:49.4941586Z 2025-05-07T20:25:49.4941592Z 2025-05-07T20:25:49.4941597Z 2025-05-07T20:25:49.4941903Z 2025-05-07T20:25:49.5836722Z cuda-nvcc-dev_linux- | 12.7 MB | ##2 | 22%  2025-05-07T20:25:49.5837214Z 2025-05-07T20:25:49.5837220Z 2025-05-07T20:25:49.5837225Z 2025-05-07T20:25:49.5837230Z 2025-05-07T20:25:49.5837235Z 2025-05-07T20:25:49.5837240Z 2025-05-07T20:25:49.5837245Z 2025-05-07T20:25:49.5837250Z 2025-05-07T20:25:49.5837268Z 2025-05-07T20:25:49.5837281Z 2025-05-07T20:25:49.5837287Z 2025-05-07T20:25:49.5837292Z 2025-05-07T20:25:49.5837297Z 2025-05-07T20:25:49.5837303Z 2025-05-07T20:25:49.5840736Z 2025-05-07T20:25:49.5947886Z cuda-nvvm-impl-12.8. | 20.8 MB | ##8 | 29%  2025-05-07T20:25:49.5948356Z 2025-05-07T20:25:49.5948362Z 2025-05-07T20:25:49.5948367Z 2025-05-07T20:25:49.5948372Z 2025-05-07T20:25:49.5948377Z 2025-05-07T20:25:49.5948382Z 2025-05-07T20:25:49.5948387Z 2025-05-07T20:25:49.5948392Z 2025-05-07T20:25:49.5948397Z 2025-05-07T20:25:49.5948403Z 2025-05-07T20:25:49.5948408Z 2025-05-07T20:25:49.5948423Z 2025-05-07T20:25:49.5948428Z 2025-05-07T20:25:49.5948434Z 2025-05-07T20:25:49.5948439Z 2025-05-07T20:25:49.5949230Z 2025-05-07T20:25:49.6842329Z cuda-nvcc-dev_linux- | 12.7 MB | ####5 | 45%  2025-05-07T20:25:49.6842813Z 2025-05-07T20:25:49.6842818Z 2025-05-07T20:25:49.6842823Z 2025-05-07T20:25:49.6842840Z 2025-05-07T20:25:49.6842855Z 2025-05-07T20:25:49.6842861Z 2025-05-07T20:25:49.6842866Z 2025-05-07T20:25:49.6842871Z 2025-05-07T20:25:49.6842876Z 2025-05-07T20:25:49.6842881Z 2025-05-07T20:25:49.6842886Z 2025-05-07T20:25:49.6842891Z 2025-05-07T20:25:49.6842896Z 2025-05-07T20:25:49.6842901Z 2025-05-07T20:25:49.6842906Z 2025-05-07T20:25:49.6951519Z cuda-nvvm-impl-12.8. | 20.8 MB | ####3 | 44%  2025-05-07T20:25:49.6951990Z 2025-05-07T20:25:49.6951996Z 2025-05-07T20:25:49.6952001Z 2025-05-07T20:25:49.6952006Z 2025-05-07T20:25:49.6952011Z 2025-05-07T20:25:49.6952017Z 2025-05-07T20:25:49.6952022Z 2025-05-07T20:25:49.6952027Z 2025-05-07T20:25:49.6952032Z 2025-05-07T20:25:49.6952037Z 2025-05-07T20:25:49.6952051Z 2025-05-07T20:25:49.6952057Z 2025-05-07T20:25:49.6952062Z 2025-05-07T20:25:49.6952067Z 2025-05-07T20:25:49.6952072Z 2025-05-07T20:25:49.6956594Z 2025-05-07T20:25:49.7894883Z cuda-nvcc-dev_linux- | 12.7 MB | ####### | 71%  2025-05-07T20:25:49.7895383Z 2025-05-07T20:25:49.7895387Z 2025-05-07T20:25:49.7895391Z 2025-05-07T20:25:49.7895395Z 2025-05-07T20:25:49.7895398Z 2025-05-07T20:25:49.7895402Z 2025-05-07T20:25:49.7895406Z 2025-05-07T20:25:49.7895410Z 2025-05-07T20:25:49.7895413Z 2025-05-07T20:25:49.7895417Z 2025-05-07T20:25:49.7895421Z 2025-05-07T20:25:49.7895424Z 2025-05-07T20:25:49.7895428Z 2025-05-07T20:25:49.7895432Z 2025-05-07T20:25:49.7895436Z 2025-05-07T20:25:49.7931639Z cuda-nvvm-impl-12.8. | 20.8 MB | #####8 | 58%  2025-05-07T20:25:49.7932093Z 2025-05-07T20:25:49.7932098Z 2025-05-07T20:25:49.7932104Z 2025-05-07T20:25:49.7932109Z 2025-05-07T20:25:49.7932114Z 2025-05-07T20:25:49.7932119Z 2025-05-07T20:25:49.7932124Z 2025-05-07T20:25:49.7932129Z 2025-05-07T20:25:49.7932134Z 2025-05-07T20:25:49.7932140Z 2025-05-07T20:25:49.7932154Z 2025-05-07T20:25:49.7932159Z 2025-05-07T20:25:49.7932383Z 2025-05-07T20:25:49.7934793Z 2025-05-07T20:25:49.8019367Z python-3.9.18 | 22.7 MB | ########## | 100%  2025-05-07T20:25:49.8019790Z 2025-05-07T20:25:49.8019796Z 2025-05-07T20:25:49.8019801Z 2025-05-07T20:25:49.8019806Z 2025-05-07T20:25:49.8019811Z 2025-05-07T20:25:49.8019816Z 2025-05-07T20:25:49.8019821Z 2025-05-07T20:25:49.8019826Z 2025-05-07T20:25:49.8019831Z 2025-05-07T20:25:49.8019836Z 2025-05-07T20:25:49.8019841Z 2025-05-07T20:25:49.8019846Z 2025-05-07T20:25:49.8019851Z 2025-05-07T20:25:49.8019856Z 2025-05-07T20:25:49.8019861Z 2025-05-07T20:25:49.8021486Z 2025-05-07T20:25:49.8054631Z cuda-nvcc-dev_linux- | 12.7 MB | #########4 | 94%  2025-05-07T20:25:49.8055089Z 2025-05-07T20:25:49.8055095Z 2025-05-07T20:25:49.8055100Z 2025-05-07T20:25:49.8055105Z 2025-05-07T20:25:49.8055110Z 2025-05-07T20:25:49.8055115Z 2025-05-07T20:25:49.8055120Z 2025-05-07T20:25:49.8055125Z 2025-05-07T20:25:49.8055141Z 2025-05-07T20:25:49.8055160Z 2025-05-07T20:25:49.8055166Z 2025-05-07T20:25:49.8055171Z 2025-05-07T20:25:49.8056563Z 2025-05-07T20:25:49.8480656Z cuda-nvvm-tools-12.8 | 23.5 MB | ########## | 100%  2025-05-07T20:25:49.8481107Z 2025-05-07T20:25:49.8481113Z 2025-05-07T20:25:49.8481118Z 2025-05-07T20:25:49.8481123Z 2025-05-07T20:25:49.8481128Z 2025-05-07T20:25:49.8481133Z 2025-05-07T20:25:49.8481146Z 2025-05-07T20:25:49.8481151Z 2025-05-07T20:25:49.8481156Z 2025-05-07T20:25:49.8481161Z 2025-05-07T20:25:49.8481166Z 2025-05-07T20:25:49.8481171Z 2025-05-07T20:25:49.8481176Z 2025-05-07T20:25:49.8481181Z 2025-05-07T20:25:49.8481186Z 2025-05-07T20:25:49.8481191Z 2025-05-07T20:25:49.8481196Z 2025-05-07T20:25:49.8485021Z 2025-05-07T20:25:49.8901298Z cuda-nvdisasm-12.8.5 | 4.9 MB | | 0%  2025-05-07T20:25:49.8901782Z 2025-05-07T20:25:49.8901788Z 2025-05-07T20:25:49.8901807Z 2025-05-07T20:25:49.8901819Z 2025-05-07T20:25:49.8901824Z 2025-05-07T20:25:49.8901829Z 2025-05-07T20:25:49.8901834Z 2025-05-07T20:25:49.8901839Z 2025-05-07T20:25:49.8901845Z 2025-05-07T20:25:49.8901849Z 2025-05-07T20:25:49.8901854Z 2025-05-07T20:25:49.8901859Z 2025-05-07T20:25:49.8901865Z 2025-05-07T20:25:49.8901869Z 2025-05-07T20:25:49.8901883Z 2025-05-07T20:25:49.8999934Z cuda-nvvm-impl-12.8. | 20.8 MB | #######4 | 75%  2025-05-07T20:25:49.9000384Z 2025-05-07T20:25:49.9000390Z 2025-05-07T20:25:49.9000395Z 2025-05-07T20:25:49.9000407Z 2025-05-07T20:25:49.9000413Z 2025-05-07T20:25:49.9000418Z 2025-05-07T20:25:49.9000423Z 2025-05-07T20:25:49.9000429Z 2025-05-07T20:25:49.9000434Z 2025-05-07T20:25:49.9000439Z 2025-05-07T20:25:49.9000444Z 2025-05-07T20:25:49.9000449Z 2025-05-07T20:25:49.9000454Z 2025-05-07T20:25:49.9000460Z 2025-05-07T20:25:49.9000465Z 2025-05-07T20:25:49.9000470Z 2025-05-07T20:25:49.9006007Z 2025-05-07T20:25:49.9482989Z cuda-sanitizer-api-1 | 8.8 MB | | 0%  2025-05-07T20:25:49.9483493Z 2025-05-07T20:25:49.9483497Z 2025-05-07T20:25:49.9483501Z 2025-05-07T20:25:49.9483505Z 2025-05-07T20:25:49.9483508Z 2025-05-07T20:25:49.9483512Z 2025-05-07T20:25:49.9483515Z 2025-05-07T20:25:49.9483519Z 2025-05-07T20:25:49.9483523Z 2025-05-07T20:25:49.9483526Z 2025-05-07T20:25:49.9483530Z 2025-05-07T20:25:49.9483534Z 2025-05-07T20:25:49.9483537Z 2025-05-07T20:25:49.9483541Z 2025-05-07T20:25:49.9483545Z 2025-05-07T20:25:49.9483549Z 2025-05-07T20:25:49.9483552Z 2025-05-07T20:25:49.9484316Z 2025-05-07T20:25:49.9994540Z cuda-nvdisasm-12.8.5 | 4.9 MB | #####9 | 59%  2025-05-07T20:25:49.9995008Z 2025-05-07T20:25:49.9995014Z 2025-05-07T20:25:49.9995020Z 2025-05-07T20:25:49.9995033Z 2025-05-07T20:25:49.9995038Z 2025-05-07T20:25:49.9995044Z 2025-05-07T20:25:49.9995049Z 2025-05-07T20:25:49.9995270Z 2025-05-07T20:25:49.9995282Z 2025-05-07T20:25:49.9995287Z 2025-05-07T20:25:49.9995292Z 2025-05-07T20:25:49.9995297Z 2025-05-07T20:25:49.9995302Z 2025-05-07T20:25:49.9995307Z 2025-05-07T20:25:49.9995312Z 2025-05-07T20:25:50.0002525Z cuda-nvvm-impl-12.8. | 20.8 MB | ########9 | 89%  2025-05-07T20:25:50.0002981Z 2025-05-07T20:25:50.0002987Z 2025-05-07T20:25:50.0002992Z 2025-05-07T20:25:50.0002997Z 2025-05-07T20:25:50.0003002Z 2025-05-07T20:25:50.0003007Z 2025-05-07T20:25:50.0003012Z 2025-05-07T20:25:50.0003017Z 2025-05-07T20:25:50.0003022Z 2025-05-07T20:25:50.0003027Z 2025-05-07T20:25:50.0003032Z 2025-05-07T20:25:50.0003038Z 2025-05-07T20:25:50.0003043Z 2025-05-07T20:25:50.0003048Z 2025-05-07T20:25:50.0003053Z 2025-05-07T20:25:50.0003066Z 2025-05-07T20:25:50.0003072Z 2025-05-07T20:25:50.0255516Z cuda-sanitizer-api-1 | 8.8 MB | ### | 31%  2025-05-07T20:25:50.0255993Z 2025-05-07T20:25:50.0256010Z 2025-05-07T20:25:50.0256029Z 2025-05-07T20:25:50.0256035Z 2025-05-07T20:25:50.0256040Z 2025-05-07T20:25:50.0256045Z 2025-05-07T20:25:50.0256050Z 2025-05-07T20:25:50.0256056Z 2025-05-07T20:25:50.0256060Z 2025-05-07T20:25:50.0258470Z 2025-05-07T20:25:50.1005774Z gds-tools-1.13.0.11 | 37.9 MB | ########## | 100%  2025-05-07T20:25:50.1006198Z 2025-05-07T20:25:50.1006204Z 2025-05-07T20:25:50.1006210Z 2025-05-07T20:25:50.1006216Z 2025-05-07T20:25:50.1006222Z 2025-05-07T20:25:50.1006229Z 2025-05-07T20:25:50.1006235Z 2025-05-07T20:25:50.1006241Z 2025-05-07T20:25:50.1006248Z 2025-05-07T20:25:50.1006254Z 2025-05-07T20:25:50.1006260Z 2025-05-07T20:25:50.1006267Z 2025-05-07T20:25:50.1006273Z 2025-05-07T20:25:50.1006280Z 2025-05-07T20:25:50.1006286Z 2025-05-07T20:25:50.1006292Z 2025-05-07T20:25:50.1011795Z 2025-05-07T20:25:50.2127405Z cuda-sanitizer-api-1 | 8.8 MB | ######1 | 62%  2025-05-07T20:25:50.2127906Z 2025-05-07T20:25:50.2127912Z 2025-05-07T20:25:50.2127917Z 2025-05-07T20:25:50.2127931Z 2025-05-07T20:25:50.2127937Z 2025-05-07T20:25:50.2127942Z 2025-05-07T20:25:50.2127947Z 2025-05-07T20:25:50.2127952Z 2025-05-07T20:25:50.2127957Z 2025-05-07T20:25:50.2127962Z 2025-05-07T20:25:50.2127967Z 2025-05-07T20:25:50.2127972Z 2025-05-07T20:25:50.2127977Z 2025-05-07T20:25:50.2127982Z 2025-05-07T20:25:50.2127987Z 2025-05-07T20:25:50.2127992Z 2025-05-07T20:25:50.2127997Z 2025-05-07T20:25:50.2128377Z 2025-05-07T20:25:50.2578389Z cuda-nvdisasm-12.8.5 | 4.9 MB | ########## | 100%  2025-05-07T20:25:50.2578864Z 2025-05-07T20:25:50.2578870Z 2025-05-07T20:25:50.2578875Z 2025-05-07T20:25:50.2578880Z 2025-05-07T20:25:50.2578885Z 2025-05-07T20:25:50.2578890Z 2025-05-07T20:25:50.2578895Z 2025-05-07T20:25:50.2578900Z 2025-05-07T20:25:50.2578905Z 2025-05-07T20:25:50.2578910Z 2025-05-07T20:25:50.2578915Z 2025-05-07T20:25:50.2578937Z 2025-05-07T20:25:50.2579147Z 2025-05-07T20:25:50.2579154Z 2025-05-07T20:25:50.2579159Z 2025-05-07T20:25:50.2579164Z 2025-05-07T20:25:50.2579169Z 2025-05-07T20:25:50.2579175Z 2025-05-07T20:25:50.2587596Z 2025-05-07T20:25:50.2689326Z ... (more hidden) ... 2025-05-07T20:25:50.2689732Z 2025-05-07T20:25:50.2689738Z 2025-05-07T20:25:50.2689744Z 2025-05-07T20:25:50.2689749Z 2025-05-07T20:25:50.2689754Z 2025-05-07T20:25:50.2689759Z 2025-05-07T20:25:50.2689764Z 2025-05-07T20:25:50.2689769Z 2025-05-07T20:25:50.2689774Z 2025-05-07T20:25:50.2689779Z 2025-05-07T20:25:50.2689784Z 2025-05-07T20:25:50.2689790Z 2025-05-07T20:25:50.2689795Z 2025-05-07T20:25:50.2689799Z 2025-05-07T20:25:50.2689805Z 2025-05-07T20:25:50.2691870Z 2025-05-07T20:25:50.3582268Z cuda-nvcc-dev_linux- | 12.7 MB | ########## | 100%  2025-05-07T20:25:50.3582935Z 2025-05-07T20:25:50.3582940Z 2025-05-07T20:25:50.3583162Z 2025-05-07T20:25:50.3583174Z 2025-05-07T20:25:50.3583180Z 2025-05-07T20:25:50.3583197Z 2025-05-07T20:25:50.3583202Z 2025-05-07T20:25:50.3583207Z 2025-05-07T20:25:50.3583212Z 2025-05-07T20:25:50.3583217Z 2025-05-07T20:25:50.3583222Z 2025-05-07T20:25:50.3583227Z 2025-05-07T20:25:50.3583232Z 2025-05-07T20:25:50.3583237Z 2025-05-07T20:25:50.3583242Z 2025-05-07T20:25:50.3583247Z 2025-05-07T20:25:50.3583252Z 2025-05-07T20:25:50.3583258Z 2025-05-07T20:25:50.3583263Z 2025-05-07T20:25:50.4942467Z ... (more hidden) ... 2025-05-07T20:25:50.4942887Z 2025-05-07T20:25:50.4942892Z 2025-05-07T20:25:50.4942897Z 2025-05-07T20:25:50.4942903Z 2025-05-07T20:25:50.4942908Z 2025-05-07T20:25:50.4942915Z 2025-05-07T20:25:50.4942920Z 2025-05-07T20:25:50.4942925Z 2025-05-07T20:25:50.4942930Z 2025-05-07T20:25:50.4942935Z 2025-05-07T20:25:50.4942941Z 2025-05-07T20:25:50.4942946Z 2025-05-07T20:25:50.4942951Z 2025-05-07T20:25:50.4942967Z 2025-05-07T20:25:50.4942989Z 2025-05-07T20:25:50.4943005Z 2025-05-07T20:25:50.4943811Z 2025-05-07T20:25:50.4944547Z cuda-sanitizer-api-1 | 8.8 MB | ########## | 100%  2025-05-07T20:25:50.4945049Z 2025-05-07T20:25:50.4945055Z 2025-05-07T20:25:50.4945061Z 2025-05-07T20:25:50.4945066Z 2025-05-07T20:25:50.4945071Z 2025-05-07T20:25:50.4945076Z 2025-05-07T20:25:50.4945082Z 2025-05-07T20:25:50.4945087Z 2025-05-07T20:25:50.4945092Z 2025-05-07T20:25:50.4945097Z 2025-05-07T20:25:50.4945102Z 2025-05-07T20:25:50.4945107Z 2025-05-07T20:25:50.4945112Z 2025-05-07T20:25:50.4945117Z 2025-05-07T20:25:50.4945122Z 2025-05-07T20:25:50.4945127Z 2025-05-07T20:25:50.4945132Z 2025-05-07T20:25:50.5150125Z cuda-sanitizer-api-1 | 8.8 MB | ########## | 100%  2025-05-07T20:25:50.5150620Z 2025-05-07T20:25:50.5150625Z 2025-05-07T20:25:50.5150630Z 2025-05-07T20:25:50.5150635Z 2025-05-07T20:25:50.5150640Z 2025-05-07T20:25:50.5150655Z 2025-05-07T20:25:50.5150665Z 2025-05-07T20:25:50.5150670Z 2025-05-07T20:25:50.5150684Z 2025-05-07T20:25:50.5150690Z 2025-05-07T20:25:50.5150695Z 2025-05-07T20:25:50.5150700Z 2025-05-07T20:25:50.5150704Z 2025-05-07T20:25:50.5150709Z 2025-05-07T20:25:50.5150714Z 2025-05-07T20:25:50.5150719Z 2025-05-07T20:25:50.5150724Z 2025-05-07T20:25:50.5150730Z 2025-05-07T20:25:50.5151350Z 2025-05-07T20:25:50.7173117Z ... (more hidden) ... 2025-05-07T20:25:50.7173464Z 2025-05-07T20:25:50.7173470Z 2025-05-07T20:25:50.7173475Z 2025-05-07T20:25:50.7173480Z 2025-05-07T20:25:50.7173485Z 2025-05-07T20:25:50.7173493Z 2025-05-07T20:25:50.7173500Z 2025-05-07T20:25:50.7173505Z 2025-05-07T20:25:50.7173511Z 2025-05-07T20:25:50.7173516Z 2025-05-07T20:25:50.7173521Z 2025-05-07T20:25:50.7173527Z 2025-05-07T20:25:50.7173533Z 2025-05-07T20:25:50.7173538Z 2025-05-07T20:25:50.7175464Z 2025-05-07T20:25:52.3996710Z cuda-nvvm-impl-12.8. | 20.8 MB | ########## | 100%  2025-05-07T20:25:52.3997089Z 2025-05-07T20:25:52.3997102Z 2025-05-07T20:25:52.3997106Z 2025-05-07T20:25:52.3997109Z 2025-05-07T20:25:52.3997113Z 2025-05-07T20:25:52.3997117Z 2025-05-07T20:25:52.3997120Z 2025-05-07T20:25:52.3997124Z 2025-05-07T20:25:52.3997131Z 2025-05-07T20:25:52.7726039Z libcurand-10.3.9.55 | 43.6 MB | ########## | 100%  2025-05-07T20:25:52.7726378Z 2025-05-07T20:25:52.7726393Z 2025-05-07T20:25:52.7726396Z 2025-05-07T20:25:52.7726400Z 2025-05-07T20:25:52.7726404Z 2025-05-07T20:25:52.7726408Z 2025-05-07T20:25:52.7726412Z 2025-05-07T20:25:53.1775681Z cuda-nvvp-12.8.57 | 112.4 MB | ########## | 100%  2025-05-07T20:25:53.3812579Z libcublas-12.8.3.14 | 460.2 MB | ########## | 100% 2025-05-07T20:25:53.3813107Z 2025-05-07T20:25:53.3813116Z 2025-05-07T20:25:53.3813123Z 2025-05-07T20:25:53.3813145Z 2025-05-07T20:25:53.3813152Z 2025-05-07T20:25:53.7253478Z libnpp-12.3.3.65 | 130.6 MB | ########## | 100%  2025-05-07T20:25:53.7254276Z 2025-05-07T20:25:53.7254287Z 2025-05-07T20:25:53.7254297Z 2025-05-07T20:25:53.7254307Z 2025-05-07T20:25:53.7254317Z 2025-05-07T20:25:53.7254327Z 2025-05-07T20:25:53.7254337Z 2025-05-07T20:25:53.7254346Z 2025-05-07T20:25:53.7254356Z 2025-05-07T20:25:53.7254367Z 2025-05-07T20:25:53.7254377Z 2025-05-07T20:25:53.7254387Z 2025-05-07T20:25:53.8893417Z cuda-nvcc-tools-12.8 | 24.5 MB | ########## | 100%  2025-05-07T20:25:53.8893773Z 2025-05-07T20:25:53.8893778Z 2025-05-07T20:25:53.8893781Z 2025-05-07T20:25:53.8893785Z 2025-05-07T20:25:53.8893789Z 2025-05-07T20:25:53.8893793Z 2025-05-07T20:25:53.8893797Z 2025-05-07T20:25:53.8893805Z 2025-05-07T20:25:54.1938295Z cuda-nvrtc-12.8.61 | 63.1 MB | ########## | 100%  2025-05-07T20:25:54.1938621Z 2025-05-07T20:25:54.1938625Z 2025-05-07T20:25:54.1938629Z 2025-05-07T20:25:54.1938633Z 2025-05-07T20:25:54.1938664Z 2025-05-07T20:25:54.1938681Z 2025-05-07T20:25:54.1938685Z 2025-05-07T20:25:54.1938688Z 2025-05-07T20:25:54.1938692Z 2025-05-07T20:25:54.1938713Z 2025-05-07T20:25:54.1939422Z 2025-05-07T20:25:54.6808444Z libnvjitlink-12.8.61 | 28.7 MB | ########## | 100%  2025-05-07T20:25:54.6808775Z 2025-05-07T20:25:54.6808780Z 2025-05-07T20:25:54.6808784Z 2025-05-07T20:25:54.6808788Z 2025-05-07T20:25:54.6808801Z 2025-05-07T20:25:54.6808804Z 2025-05-07T20:25:54.6808808Z 2025-05-07T20:25:54.6808812Z 2025-05-07T20:25:54.6808815Z 2025-05-07T20:25:54.6808819Z 2025-05-07T20:25:54.6808822Z 2025-05-07T20:25:54.6808826Z 2025-05-07T20:25:54.6808835Z 2025-05-07T20:25:54.7333327Z cuda-nvvm-tools-12.8 | 23.5 MB | ########## | 100%  2025-05-07T20:25:54.7333774Z 2025-05-07T20:25:54.7333779Z 2025-05-07T20:25:54.7333784Z 2025-05-07T20:25:54.7333789Z 2025-05-07T20:25:54.7333795Z 2025-05-07T20:25:54.7333800Z 2025-05-07T20:25:54.7333831Z 2025-05-07T20:25:54.7333852Z 2025-05-07T20:25:54.7333857Z 2025-05-07T20:25:54.7333863Z 2025-05-07T20:25:54.7333867Z 2025-05-07T20:25:54.7333872Z 2025-05-07T20:25:54.7333877Z 2025-05-07T20:25:54.7333882Z 2025-05-07T20:25:54.7333887Z 2025-05-07T20:25:54.7333892Z 2025-05-07T20:25:54.7333896Z 2025-05-07T20:25:54.7333912Z 2025-05-07T20:25:55.2028551Z cuda-nvdisasm-12.8.5 | 4.9 MB | ########## | 100%  2025-05-07T20:25:55.2028926Z 2025-05-07T20:25:55.2028930Z 2025-05-07T20:25:55.2028943Z 2025-05-07T20:25:55.2028947Z 2025-05-07T20:25:55.2028951Z 2025-05-07T20:25:55.2028954Z 2025-05-07T20:25:55.2028958Z 2025-05-07T20:25:55.2028961Z 2025-05-07T20:25:55.2028965Z 2025-05-07T20:25:55.2028976Z 2025-05-07T20:25:55.2028979Z 2025-05-07T20:25:55.2028983Z 2025-05-07T20:25:55.2028986Z 2025-05-07T20:25:55.2028990Z 2025-05-07T20:25:55.2028994Z 2025-05-07T20:25:55.2028998Z 2025-05-07T20:25:55.3334753Z cuda-nvcc-dev_linux- | 12.7 MB | ########## | 100%  2025-05-07T20:25:55.3335131Z 2025-05-07T20:25:55.3335136Z 2025-05-07T20:25:55.3335139Z 2025-05-07T20:25:55.3335143Z 2025-05-07T20:25:55.3335146Z 2025-05-07T20:25:55.3335150Z 2025-05-07T20:25:55.3335153Z 2025-05-07T20:25:55.3335157Z 2025-05-07T20:25:55.3335160Z 2025-05-07T20:25:55.3335164Z 2025-05-07T20:25:55.3335173Z 2025-05-07T20:25:55.3335176Z 2025-05-07T20:25:55.3335188Z 2025-05-07T20:25:55.3335192Z 2025-05-07T20:25:55.4715103Z python-3.9.18 | 22.7 MB | ########## | 100%  2025-05-07T20:25:55.4715441Z 2025-05-07T20:25:55.4715445Z 2025-05-07T20:25:55.4715449Z 2025-05-07T20:25:55.4715453Z 2025-05-07T20:25:55.4715456Z 2025-05-07T20:25:55.4715460Z 2025-05-07T20:25:55.4715464Z 2025-05-07T20:25:55.4715468Z 2025-05-07T20:25:55.4715472Z 2025-05-07T20:25:55.4715475Z 2025-05-07T20:25:55.4715479Z 2025-05-07T20:25:55.4715483Z 2025-05-07T20:25:55.4715486Z 2025-05-07T20:25:55.4715738Z 2025-05-07T20:25:55.4715753Z 2025-05-07T20:25:55.4715757Z 2025-05-07T20:25:55.4715765Z 2025-05-07T20:25:55.5147262Z cuda-sanitizer-api-1 | 8.8 MB | ########## | 100%  2025-05-07T20:25:55.5147630Z 2025-05-07T20:25:55.5147634Z 2025-05-07T20:25:55.5147638Z 2025-05-07T20:25:55.5147642Z 2025-05-07T20:25:55.5147646Z 2025-05-07T20:25:55.5147650Z 2025-05-07T20:25:55.5147653Z 2025-05-07T20:25:55.5147657Z 2025-05-07T20:25:55.5147661Z 2025-05-07T20:25:55.5147672Z 2025-05-07T20:25:55.5147676Z 2025-05-07T20:25:55.5147679Z 2025-05-07T20:25:55.5147683Z 2025-05-07T20:25:55.5147686Z 2025-05-07T20:25:55.5147690Z 2025-05-07T20:25:55.5147694Z 2025-05-07T20:25:55.5147697Z 2025-05-07T20:25:55.5147701Z 2025-05-07T20:25:55.5147705Z 2025-05-07T20:25:55.9475069Z ... (more hidden) ... 2025-05-07T20:25:55.9475374Z 2025-05-07T20:25:55.9475378Z 2025-05-07T20:25:55.9475382Z 2025-05-07T20:25:55.9475406Z 2025-05-07T20:25:55.9475422Z 2025-05-07T20:25:55.9475426Z 2025-05-07T20:25:55.9475429Z 2025-05-07T20:25:55.9475433Z 2025-05-07T20:25:55.9475436Z 2025-05-07T20:25:55.9475440Z 2025-05-07T20:25:55.9475443Z 2025-05-07T20:25:55.9475447Z 2025-05-07T20:25:55.9475451Z 2025-05-07T20:25:55.9475454Z 2025-05-07T20:25:55.9477119Z 2025-05-07T20:26:00.0565382Z cuda-nvvm-impl-12.8. | 20.8 MB | ########## | 100%  2025-05-07T20:26:00.0565745Z 2025-05-07T20:26:01.3021157Z nsight-compute-2025. | 320.6 MB | ########## | 100%  2025-05-07T20:26:01.3029193Z libcublas-12.8.3.14 | 460.2 MB | ########## | 100% 2025-05-07T20:26:01.3029533Z 2025-05-07T20:26:01.3029539Z 2025-05-07T20:26:01.3029543Z 2025-05-07T20:26:01.3029548Z 2025-05-07T20:26:01.3029553Z 2025-05-07T20:26:01.3029558Z 2025-05-07T20:26:01.3029563Z 2025-05-07T20:26:01.3029570Z 2025-05-07T20:26:01.3029575Z 2025-05-07T20:26:01.3029581Z 2025-05-07T20:26:01.3029586Z 2025-05-07T20:26:01.3029618Z 2025-05-07T20:26:01.3029639Z 2025-05-07T20:26:01.3029645Z 2025-05-07T20:26:01.3029650Z 2025-05-07T20:26:01.3029655Z 2025-05-07T20:26:01.3029837Z 2025-05-07T20:26:01.3029842Z 2025-05-07T20:26:01.3029848Z 2025-05-07T20:26:01.3029943Z 2025-05-07T20:26:01.3030298Z  2025-05-07T20:26:01.3030627Z 2025-05-07T20:26:01.3030848Z 2025-05-07T20:26:01.3031012Z  2025-05-07T20:26:01.3031230Z 2025-05-07T20:26:01.3031234Z 2025-05-07T20:26:01.3031404Z  2025-05-07T20:26:01.3031618Z 2025-05-07T20:26:01.3031622Z 2025-05-07T20:26:01.3031626Z 2025-05-07T20:26:01.3032108Z  2025-05-07T20:26:01.3032424Z 2025-05-07T20:26:01.3032430Z 2025-05-07T20:26:01.3032436Z 2025-05-07T20:26:01.3032441Z 2025-05-07T20:26:01.3032973Z  2025-05-07T20:26:01.3033285Z 2025-05-07T20:26:01.3033292Z 2025-05-07T20:26:01.3033297Z 2025-05-07T20:26:01.3033303Z 2025-05-07T20:26:01.3033308Z 2025-05-07T20:26:01.3033568Z  2025-05-07T20:26:01.3033864Z 2025-05-07T20:26:01.3033869Z 2025-05-07T20:26:01.3033874Z 2025-05-07T20:26:01.3033879Z 2025-05-07T20:26:01.3033885Z 2025-05-07T20:26:01.3033890Z 2025-05-07T20:26:01.3034136Z  2025-05-07T20:26:01.3034434Z 2025-05-07T20:26:01.3034439Z 2025-05-07T20:26:01.3034444Z 2025-05-07T20:26:01.3034449Z 2025-05-07T20:26:01.3034454Z 2025-05-07T20:26:01.3034460Z 2025-05-07T20:26:01.3034465Z 2025-05-07T20:26:01.3034720Z  2025-05-07T20:26:01.3035018Z 2025-05-07T20:26:01.3035024Z 2025-05-07T20:26:01.3035029Z 2025-05-07T20:26:01.3035238Z 2025-05-07T20:26:01.3035253Z 2025-05-07T20:26:01.3035259Z 2025-05-07T20:26:01.3035272Z 2025-05-07T20:26:01.3035277Z 2025-05-07T20:26:01.3035532Z  2025-05-07T20:26:01.3035839Z 2025-05-07T20:26:01.3035844Z 2025-05-07T20:26:01.3035849Z 2025-05-07T20:26:01.3035854Z 2025-05-07T20:26:01.3035859Z 2025-05-07T20:26:01.3035872Z 2025-05-07T20:26:01.3035877Z 2025-05-07T20:26:01.3035882Z 2025-05-07T20:26:01.3035896Z 2025-05-07T20:26:01.3036145Z  2025-05-07T20:26:01.3036455Z 2025-05-07T20:26:01.3036461Z 2025-05-07T20:26:01.3036466Z 2025-05-07T20:26:01.3036471Z 2025-05-07T20:26:01.3036476Z 2025-05-07T20:26:01.3036482Z 2025-05-07T20:26:01.3036487Z 2025-05-07T20:26:01.3036492Z 2025-05-07T20:26:01.3036497Z 2025-05-07T20:26:01.3036502Z 2025-05-07T20:26:01.3036760Z  2025-05-07T20:26:01.3037092Z 2025-05-07T20:26:01.3037097Z 2025-05-07T20:26:01.3037102Z 2025-05-07T20:26:01.3037107Z 2025-05-07T20:26:01.3037112Z 2025-05-07T20:26:01.3037117Z 2025-05-07T20:26:01.3037122Z 2025-05-07T20:26:01.3037128Z 2025-05-07T20:26:01.3037133Z 2025-05-07T20:26:01.3037138Z 2025-05-07T20:26:01.3037143Z 2025-05-07T20:26:01.3037424Z  2025-05-07T20:26:01.3037745Z 2025-05-07T20:26:01.3037750Z 2025-05-07T20:26:01.3037756Z 2025-05-07T20:26:01.3037761Z 2025-05-07T20:26:01.3037766Z 2025-05-07T20:26:01.3037772Z 2025-05-07T20:26:01.3037777Z 2025-05-07T20:26:01.3037782Z 2025-05-07T20:26:01.3037787Z 2025-05-07T20:26:01.3037792Z 2025-05-07T20:26:01.3037796Z 2025-05-07T20:26:01.3037801Z 2025-05-07T20:26:01.3038073Z  2025-05-07T20:26:01.3038394Z 2025-05-07T20:26:01.3038399Z 2025-05-07T20:26:01.3038404Z 2025-05-07T20:26:01.3038423Z 2025-05-07T20:26:01.3038429Z 2025-05-07T20:26:01.3038434Z 2025-05-07T20:26:01.3038439Z 2025-05-07T20:26:01.3038452Z 2025-05-07T20:26:01.3038457Z 2025-05-07T20:26:01.3038462Z 2025-05-07T20:26:01.3038467Z 2025-05-07T20:26:01.3038472Z 2025-05-07T20:26:01.3038477Z 2025-05-07T20:26:01.3038748Z  2025-05-07T20:26:01.3039078Z 2025-05-07T20:26:01.3039083Z 2025-05-07T20:26:01.3039088Z 2025-05-07T20:26:01.3039093Z 2025-05-07T20:26:01.3039098Z 2025-05-07T20:26:01.3039103Z 2025-05-07T20:26:01.3039108Z 2025-05-07T20:26:01.3039113Z 2025-05-07T20:26:01.3039133Z 2025-05-07T20:26:01.3039138Z 2025-05-07T20:26:01.3039143Z 2025-05-07T20:26:01.3039148Z 2025-05-07T20:26:01.3039153Z 2025-05-07T20:26:01.3039158Z 2025-05-07T20:26:01.3039458Z  2025-05-07T20:26:01.3039781Z 2025-05-07T20:26:01.3039794Z 2025-05-07T20:26:01.3039905Z 2025-05-07T20:26:01.3039911Z 2025-05-07T20:26:01.3039916Z 2025-05-07T20:26:01.3039921Z 2025-05-07T20:26:01.3039926Z 2025-05-07T20:26:01.3039931Z 2025-05-07T20:26:01.3039936Z 2025-05-07T20:26:01.3039941Z 2025-05-07T20:26:01.3039946Z 2025-05-07T20:26:01.3039951Z 2025-05-07T20:26:01.3039964Z 2025-05-07T20:26:01.3039969Z 2025-05-07T20:26:01.3039974Z 2025-05-07T20:26:01.3040270Z  2025-05-07T20:26:01.3040604Z 2025-05-07T20:26:01.3040610Z 2025-05-07T20:26:01.3040616Z 2025-05-07T20:26:01.3040621Z 2025-05-07T20:26:01.3040636Z 2025-05-07T20:26:01.3040641Z 2025-05-07T20:26:01.3040646Z 2025-05-07T20:26:01.3040651Z 2025-05-07T20:26:01.3040655Z 2025-05-07T20:26:01.3040660Z 2025-05-07T20:26:01.3040665Z 2025-05-07T20:26:01.3040670Z 2025-05-07T20:26:01.3040675Z 2025-05-07T20:26:01.3040681Z 2025-05-07T20:26:01.3040685Z 2025-05-07T20:26:01.3040691Z 2025-05-07T20:26:01.3041116Z  2025-05-07T20:26:01.3041463Z 2025-05-07T20:26:01.3041468Z 2025-05-07T20:26:01.3041473Z 2025-05-07T20:26:01.3041478Z 2025-05-07T20:26:01.3041483Z 2025-05-07T20:26:01.3041488Z 2025-05-07T20:26:01.3041493Z 2025-05-07T20:26:01.3041498Z 2025-05-07T20:26:01.3041503Z 2025-05-07T20:26:01.3041508Z 2025-05-07T20:26:01.3041513Z 2025-05-07T20:26:01.3041518Z 2025-05-07T20:26:01.3041524Z 2025-05-07T20:26:01.3041529Z 2025-05-07T20:26:01.3041534Z 2025-05-07T20:26:01.3041539Z 2025-05-07T20:26:01.3041544Z 2025-05-07T20:26:01.3041860Z  2025-05-07T20:26:01.3042201Z 2025-05-07T20:26:01.3042206Z 2025-05-07T20:26:01.3042211Z 2025-05-07T20:26:01.3042216Z 2025-05-07T20:26:01.3042222Z 2025-05-07T20:26:01.3042234Z 2025-05-07T20:26:01.3042240Z 2025-05-07T20:26:01.3042244Z 2025-05-07T20:26:01.3042249Z 2025-05-07T20:26:01.3042268Z 2025-05-07T20:26:01.3042273Z 2025-05-07T20:26:01.3042278Z 2025-05-07T20:26:01.3042283Z 2025-05-07T20:26:01.3042288Z 2025-05-07T20:26:01.3042293Z 2025-05-07T20:26:01.3042299Z 2025-05-07T20:26:01.3042304Z 2025-05-07T20:26:01.3042308Z 2025-05-07T20:26:01.3042612Z  2025-05-07T20:26:01.3042958Z 2025-05-07T20:26:01.3042963Z 2025-05-07T20:26:01.3043103Z  2025-05-07T20:26:01.3043250Z 2025-05-07T20:26:01.3043255Z 2025-05-07T20:26:01.3043394Z  2025-05-07T20:26:01.3043541Z 2025-05-07T20:26:01.3043547Z 2025-05-07T20:26:01.3043552Z 2025-05-07T20:26:01.3043715Z  2025-05-07T20:26:01.3043863Z 2025-05-07T20:26:01.3043868Z 2025-05-07T20:26:01.3043873Z 2025-05-07T20:26:01.3043879Z 2025-05-07T20:26:01.3044029Z  2025-05-07T20:26:01.3044184Z 2025-05-07T20:26:01.3044189Z 2025-05-07T20:26:01.3044194Z 2025-05-07T20:26:01.3044199Z 2025-05-07T20:26:01.3044212Z 2025-05-07T20:26:01.3044371Z  2025-05-07T20:26:01.3044537Z 2025-05-07T20:26:01.3044542Z 2025-05-07T20:26:01.3044547Z 2025-05-07T20:26:01.3044552Z 2025-05-07T20:26:01.3044558Z 2025-05-07T20:26:01.3044563Z 2025-05-07T20:26:01.3044720Z  2025-05-07T20:26:01.3044893Z 2025-05-07T20:26:01.3044898Z 2025-05-07T20:26:01.3044903Z 2025-05-07T20:26:01.3044908Z 2025-05-07T20:26:01.3044913Z 2025-05-07T20:26:01.3044918Z 2025-05-07T20:26:01.3044923Z 2025-05-07T20:26:01.3045093Z  2025-05-07T20:26:01.3045282Z 2025-05-07T20:26:01.3045287Z 2025-05-07T20:26:01.3045292Z 2025-05-07T20:26:01.3045297Z 2025-05-07T20:26:01.3045302Z 2025-05-07T20:26:01.3045308Z 2025-05-07T20:26:01.3045313Z 2025-05-07T20:26:01.3045318Z 2025-05-07T20:26:01.3045487Z  2025-05-07T20:26:01.3045692Z 2025-05-07T20:26:01.3045698Z 2025-05-07T20:26:01.3045703Z 2025-05-07T20:26:01.3045708Z 2025-05-07T20:26:01.3045713Z 2025-05-07T20:26:01.3045718Z 2025-05-07T20:26:01.3045729Z 2025-05-07T20:26:01.3045858Z 2025-05-07T20:26:01.3045864Z 2025-05-07T20:26:01.3046058Z  2025-05-07T20:26:01.3046278Z 2025-05-07T20:26:01.3046284Z 2025-05-07T20:26:01.3046289Z 2025-05-07T20:26:01.3046294Z 2025-05-07T20:26:01.3046299Z 2025-05-07T20:26:01.3046304Z 2025-05-07T20:26:01.3046309Z 2025-05-07T20:26:01.3046314Z 2025-05-07T20:26:01.3046319Z 2025-05-07T20:26:01.3046332Z 2025-05-07T20:26:01.3046515Z  2025-05-07T20:26:01.3046744Z 2025-05-07T20:26:01.3046749Z 2025-05-07T20:26:01.3046755Z 2025-05-07T20:26:01.3046759Z 2025-05-07T20:26:01.3046765Z 2025-05-07T20:26:01.3046770Z 2025-05-07T20:26:01.3046775Z 2025-05-07T20:26:01.3046787Z 2025-05-07T20:26:01.3046792Z 2025-05-07T20:26:01.3046797Z 2025-05-07T20:26:01.3046802Z 2025-05-07T20:26:01.3046990Z  2025-05-07T20:26:01.3047238Z 2025-05-07T20:26:01.3047250Z 2025-05-07T20:26:01.3047256Z 2025-05-07T20:26:01.3047261Z 2025-05-07T20:26:01.3047366Z 2025-05-07T20:26:01.3047377Z 2025-05-07T20:26:01.3047383Z 2025-05-07T20:26:01.3047388Z 2025-05-07T20:26:01.3047393Z 2025-05-07T20:26:01.3047398Z 2025-05-07T20:26:01.3047403Z 2025-05-07T20:26:01.3047408Z 2025-05-07T20:26:01.3047602Z  2025-05-07T20:26:01.3047865Z 2025-05-07T20:26:01.3047870Z 2025-05-07T20:26:01.3047875Z 2025-05-07T20:26:01.3047880Z 2025-05-07T20:26:01.3047885Z 2025-05-07T20:26:01.3047890Z 2025-05-07T20:26:01.3047895Z 2025-05-07T20:26:01.3047900Z 2025-05-07T20:26:01.3047905Z 2025-05-07T20:26:01.3047910Z 2025-05-07T20:26:01.3047915Z 2025-05-07T20:26:01.3047920Z 2025-05-07T20:26:01.3047925Z 2025-05-07T20:26:01.3048132Z  2025-05-07T20:26:01.3048393Z 2025-05-07T20:26:01.3048399Z 2025-05-07T20:26:01.3048404Z 2025-05-07T20:26:01.3048409Z 2025-05-07T20:26:01.3048414Z 2025-05-07T20:26:01.3048419Z 2025-05-07T20:26:01.3048424Z 2025-05-07T20:26:01.3048429Z 2025-05-07T20:26:01.3048435Z 2025-05-07T20:26:01.3048452Z 2025-05-07T20:26:01.3048458Z 2025-05-07T20:26:01.3048463Z 2025-05-07T20:26:01.3048478Z 2025-05-07T20:26:01.3048484Z 2025-05-07T20:26:01.3048679Z  2025-05-07T20:26:01.3048951Z 2025-05-07T20:26:01.3048956Z 2025-05-07T20:26:01.3048961Z 2025-05-07T20:26:01.3048966Z 2025-05-07T20:26:01.3048972Z 2025-05-07T20:26:01.3048986Z 2025-05-07T20:26:01.3048993Z 2025-05-07T20:26:01.3048999Z 2025-05-07T20:26:01.3049006Z 2025-05-07T20:26:01.3049012Z 2025-05-07T20:26:01.3049018Z 2025-05-07T20:26:01.3049025Z 2025-05-07T20:26:01.3049032Z 2025-05-07T20:26:01.3049038Z 2025-05-07T20:26:01.3049044Z 2025-05-07T20:26:01.3049294Z  2025-05-07T20:26:01.3049579Z 2025-05-07T20:26:01.3049584Z 2025-05-07T20:26:01.3049589Z 2025-05-07T20:26:01.3049595Z 2025-05-07T20:26:01.3049599Z 2025-05-07T20:26:01.3049605Z 2025-05-07T20:26:01.3049610Z 2025-05-07T20:26:01.3049615Z 2025-05-07T20:26:01.3049620Z 2025-05-07T20:26:01.3049631Z 2025-05-07T20:26:01.3049641Z 2025-05-07T20:26:01.3049646Z 2025-05-07T20:26:01.3049652Z 2025-05-07T20:26:01.3049657Z 2025-05-07T20:26:01.3049662Z 2025-05-07T20:26:01.3049667Z 2025-05-07T20:26:01.3049882Z  2025-05-07T20:26:01.3050179Z 2025-05-07T20:26:01.3050185Z 2025-05-07T20:26:01.3050191Z 2025-05-07T20:26:01.3050196Z 2025-05-07T20:26:01.3050201Z 2025-05-07T20:26:01.3050206Z 2025-05-07T20:26:01.3050211Z 2025-05-07T20:26:01.3050216Z 2025-05-07T20:26:01.3050221Z 2025-05-07T20:26:01.3050237Z 2025-05-07T20:26:01.3050242Z 2025-05-07T20:26:01.3050247Z 2025-05-07T20:26:01.3050252Z 2025-05-07T20:26:01.3050257Z 2025-05-07T20:26:01.3050262Z 2025-05-07T20:26:01.3050267Z 2025-05-07T20:26:01.3050272Z 2025-05-07T20:26:01.3050512Z  2025-05-07T20:26:01.3050814Z 2025-05-07T20:26:01.3050819Z 2025-05-07T20:26:01.3050824Z 2025-05-07T20:26:01.3050830Z 2025-05-07T20:26:01.3050835Z 2025-05-07T20:26:01.3050961Z 2025-05-07T20:26:01.3050968Z 2025-05-07T20:26:01.3050973Z 2025-05-07T20:26:01.3050978Z 2025-05-07T20:26:01.3050983Z 2025-05-07T20:26:01.3050988Z 2025-05-07T20:26:01.3050993Z 2025-05-07T20:26:01.3050998Z 2025-05-07T20:26:01.3051003Z 2025-05-07T20:26:01.3051008Z 2025-05-07T20:26:01.3051013Z 2025-05-07T20:26:01.3051018Z 2025-05-07T20:26:01.3051023Z 2025-05-07T20:26:01.3051274Z  2025-05-07T20:26:01.3051580Z 2025-05-07T20:26:01.3051586Z 2025-05-07T20:26:01.3051727Z  2025-05-07T20:26:01.3051872Z 2025-05-07T20:26:01.3051877Z 2025-05-07T20:26:01.3052011Z  2025-05-07T20:26:01.3052158Z 2025-05-07T20:26:01.3052163Z 2025-05-07T20:26:01.3052168Z 2025-05-07T20:26:01.3052302Z  2025-05-07T20:26:01.3052451Z 2025-05-07T20:26:01.3052457Z 2025-05-07T20:26:01.3052469Z 2025-05-07T20:26:01.3052475Z 2025-05-07T20:26:01.3052621Z  2025-05-07T20:26:01.3052780Z 2025-05-07T20:26:01.3052785Z 2025-05-07T20:26:01.3052922Z 2025-05-07T20:26:01.3052934Z 2025-05-07T20:26:01.3052939Z 2025-05-07T20:26:01.3053106Z  2025-05-07T20:26:01.3053272Z 2025-05-07T20:26:01.3053277Z 2025-05-07T20:26:01.3053282Z 2025-05-07T20:26:01.3053287Z 2025-05-07T20:26:01.3053292Z 2025-05-07T20:26:01.3053297Z 2025-05-07T20:26:01.3053456Z  2025-05-07T20:26:01.3053629Z 2025-05-07T20:26:01.3053634Z 2025-05-07T20:26:01.3053639Z 2025-05-07T20:26:01.3053644Z 2025-05-07T20:26:01.3053649Z 2025-05-07T20:26:01.3053654Z 2025-05-07T20:26:01.3053659Z 2025-05-07T20:26:01.3053831Z  2025-05-07T20:26:01.3054017Z 2025-05-07T20:26:01.3054022Z 2025-05-07T20:26:01.3054027Z 2025-05-07T20:26:01.3054032Z 2025-05-07T20:26:01.3054037Z 2025-05-07T20:26:01.3054042Z 2025-05-07T20:26:01.3054047Z 2025-05-07T20:26:01.3054053Z 2025-05-07T20:26:01.3054226Z  2025-05-07T20:26:01.3054433Z 2025-05-07T20:26:01.3054439Z 2025-05-07T20:26:01.3054444Z 2025-05-07T20:26:01.3054459Z 2025-05-07T20:26:01.3054471Z 2025-05-07T20:26:01.3054476Z 2025-05-07T20:26:01.3054490Z 2025-05-07T20:26:01.3054495Z 2025-05-07T20:26:01.3054500Z 2025-05-07T20:26:01.3054679Z  2025-05-07T20:26:01.3054896Z 2025-05-07T20:26:01.3054901Z 2025-05-07T20:26:01.3054906Z 2025-05-07T20:26:01.3054918Z 2025-05-07T20:26:01.3054923Z 2025-05-07T20:26:01.3054928Z 2025-05-07T20:26:01.3054933Z 2025-05-07T20:26:01.3054938Z 2025-05-07T20:26:01.3054943Z 2025-05-07T20:26:01.3054948Z 2025-05-07T20:26:01.3055128Z  2025-05-07T20:26:01.3055357Z 2025-05-07T20:26:01.3055362Z 2025-05-07T20:26:01.3055367Z 2025-05-07T20:26:01.3055372Z 2025-05-07T20:26:01.3055377Z 2025-05-07T20:26:01.3055387Z 2025-05-07T20:26:01.3055392Z 2025-05-07T20:26:01.3055396Z 2025-05-07T20:26:01.3055402Z 2025-05-07T20:26:01.3055407Z 2025-05-07T20:26:01.3055412Z 2025-05-07T20:26:01.3055795Z  2025-05-07T20:26:01.3056057Z 2025-05-07T20:26:01.3056063Z 2025-05-07T20:26:01.3056087Z 2025-05-07T20:26:01.3056092Z 2025-05-07T20:26:01.3056098Z 2025-05-07T20:26:01.3056103Z 2025-05-07T20:26:01.3056118Z 2025-05-07T20:26:01.3056124Z 2025-05-07T20:26:01.3056129Z 2025-05-07T20:26:01.3056143Z 2025-05-07T20:26:01.3056148Z 2025-05-07T20:26:01.3056153Z 2025-05-07T20:26:01.3056340Z  2025-05-07T20:26:01.3056612Z 2025-05-07T20:26:01.3056618Z 2025-05-07T20:26:01.3056623Z 2025-05-07T20:26:01.3056628Z 2025-05-07T20:26:01.3056633Z 2025-05-07T20:26:01.3056639Z 2025-05-07T20:26:01.3056644Z 2025-05-07T20:26:01.3056649Z 2025-05-07T20:26:01.3056654Z 2025-05-07T20:26:01.3056659Z 2025-05-07T20:26:01.3056664Z 2025-05-07T20:26:01.3056669Z 2025-05-07T20:26:01.3056674Z 2025-05-07T20:26:01.3056873Z  2025-05-07T20:26:01.3057146Z 2025-05-07T20:26:01.3057151Z 2025-05-07T20:26:01.3057156Z 2025-05-07T20:26:01.3057161Z 2025-05-07T20:26:01.3057167Z 2025-05-07T20:26:01.3057172Z 2025-05-07T20:26:01.3057184Z 2025-05-07T20:26:01.3057321Z 2025-05-07T20:26:01.3057329Z 2025-05-07T20:26:01.3057334Z 2025-05-07T20:26:01.3057339Z 2025-05-07T20:26:01.3057344Z 2025-05-07T20:26:01.3057349Z 2025-05-07T20:26:01.3057354Z 2025-05-07T20:26:01.3057578Z  2025-05-07T20:26:01.3057852Z 2025-05-07T20:26:01.3057858Z 2025-05-07T20:26:01.3057863Z 2025-05-07T20:26:01.3057868Z 2025-05-07T20:26:01.3057873Z 2025-05-07T20:26:01.3057878Z 2025-05-07T20:26:01.3057883Z 2025-05-07T20:26:01.3057897Z 2025-05-07T20:26:01.3057902Z 2025-05-07T20:26:01.3057907Z 2025-05-07T20:26:01.3057912Z 2025-05-07T20:26:01.3057917Z 2025-05-07T20:26:01.3057922Z 2025-05-07T20:26:01.3057926Z 2025-05-07T20:26:01.3057931Z 2025-05-07T20:26:01.3058137Z  2025-05-07T20:26:01.3058419Z 2025-05-07T20:26:01.3058425Z 2025-05-07T20:26:01.3058430Z 2025-05-07T20:26:01.3058435Z 2025-05-07T20:26:01.3058440Z 2025-05-07T20:26:01.3058458Z 2025-05-07T20:26:01.3058559Z 2025-05-07T20:26:01.3058571Z 2025-05-07T20:26:01.3058577Z 2025-05-07T20:26:01.3058582Z 2025-05-07T20:26:01.3058587Z 2025-05-07T20:26:01.3058592Z 2025-05-07T20:26:01.3058597Z 2025-05-07T20:26:01.3058602Z 2025-05-07T20:26:01.3058607Z 2025-05-07T20:26:01.3058612Z 2025-05-07T20:26:01.3058836Z  2025-05-07T20:26:01.3059125Z 2025-05-07T20:26:01.3059130Z 2025-05-07T20:26:01.3059135Z 2025-05-07T20:26:01.3059141Z 2025-05-07T20:26:01.3059146Z 2025-05-07T20:26:01.3059151Z 2025-05-07T20:26:01.3059157Z 2025-05-07T20:26:01.3059162Z 2025-05-07T20:26:01.3059167Z 2025-05-07T20:26:01.3059173Z 2025-05-07T20:26:01.3059186Z 2025-05-07T20:26:01.3059191Z 2025-05-07T20:26:01.3059197Z 2025-05-07T20:26:01.3059202Z 2025-05-07T20:26:01.3059207Z 2025-05-07T20:26:01.3059212Z 2025-05-07T20:26:01.3059217Z 2025-05-07T20:26:01.3059442Z  2025-05-07T20:26:01.3059747Z 2025-05-07T20:26:01.3059752Z 2025-05-07T20:26:01.3059764Z 2025-05-07T20:26:01.3059776Z 2025-05-07T20:26:01.3059781Z 2025-05-07T20:26:01.3059786Z 2025-05-07T20:26:01.3059791Z 2025-05-07T20:26:01.3059796Z 2025-05-07T20:26:01.3059801Z 2025-05-07T20:26:01.3059806Z 2025-05-07T20:26:01.3059811Z 2025-05-07T20:26:01.3059816Z 2025-05-07T20:26:01.3059821Z 2025-05-07T20:26:01.3059826Z 2025-05-07T20:26:01.3059831Z 2025-05-07T20:26:01.3059836Z 2025-05-07T20:26:01.3059841Z 2025-05-07T20:26:01.3059846Z 2025-05-07T20:26:01.3060087Z  2025-05-07T20:26:01.3060384Z 2025-05-07T20:26:01.3060389Z 2025-05-07T20:26:01.3060522Z  2025-05-07T20:26:01.3060666Z 2025-05-07T20:26:01.3060672Z 2025-05-07T20:26:01.3060812Z  2025-05-07T20:26:01.3060962Z 2025-05-07T20:26:01.3060968Z 2025-05-07T20:26:01.3060973Z 2025-05-07T20:26:01.3061114Z  2025-05-07T20:26:01.3061254Z 2025-05-07T20:26:01.3061260Z 2025-05-07T20:26:01.3061265Z 2025-05-07T20:26:01.3061277Z 2025-05-07T20:26:01.3061424Z  2025-05-07T20:26:01.3061590Z 2025-05-07T20:26:01.3061595Z 2025-05-07T20:26:01.3061601Z 2025-05-07T20:26:01.3061606Z 2025-05-07T20:26:01.3061611Z 2025-05-07T20:26:01.3061764Z  2025-05-07T20:26:01.3061927Z 2025-05-07T20:26:01.3061932Z 2025-05-07T20:26:01.3061938Z 2025-05-07T20:26:01.3061943Z 2025-05-07T20:26:01.3061948Z 2025-05-07T20:26:01.3061959Z 2025-05-07T20:26:01.3062117Z  2025-05-07T20:26:01.3062287Z 2025-05-07T20:26:01.3062292Z 2025-05-07T20:26:01.3062297Z 2025-05-07T20:26:01.3062302Z 2025-05-07T20:26:01.3062307Z 2025-05-07T20:26:01.3062312Z 2025-05-07T20:26:01.3062317Z 2025-05-07T20:26:01.3062481Z  2025-05-07T20:26:01.3062665Z 2025-05-07T20:26:01.3062670Z 2025-05-07T20:26:01.3062675Z 2025-05-07T20:26:01.3062680Z 2025-05-07T20:26:01.3062685Z 2025-05-07T20:26:01.3062690Z 2025-05-07T20:26:01.3062695Z 2025-05-07T20:26:01.3062700Z 2025-05-07T20:26:01.3062871Z  2025-05-07T20:26:01.3063092Z 2025-05-07T20:26:01.3063103Z 2025-05-07T20:26:01.3063213Z 2025-05-07T20:26:01.3063219Z 2025-05-07T20:26:01.3063224Z 2025-05-07T20:26:01.3063229Z 2025-05-07T20:26:01.3063242Z 2025-05-07T20:26:01.3063247Z 2025-05-07T20:26:01.3063252Z 2025-05-07T20:26:01.3063438Z  2025-05-07T20:26:01.3063652Z 2025-05-07T20:26:01.3063657Z 2025-05-07T20:26:01.3063662Z 2025-05-07T20:26:01.3063675Z 2025-05-07T20:26:01.3063680Z 2025-05-07T20:26:01.3063686Z 2025-05-07T20:26:01.3063691Z 2025-05-07T20:26:01.3063696Z 2025-05-07T20:26:01.3063700Z 2025-05-07T20:26:01.3063705Z 2025-05-07T20:26:01.3063890Z  2025-05-07T20:26:01.3064125Z 2025-05-07T20:26:01.3064130Z 2025-05-07T20:26:01.3064135Z 2025-05-07T20:26:01.3064140Z 2025-05-07T20:26:01.3064145Z 2025-05-07T20:26:01.3064150Z 2025-05-07T20:26:01.3064155Z 2025-05-07T20:26:01.3064160Z 2025-05-07T20:26:01.3064165Z 2025-05-07T20:26:01.3064170Z 2025-05-07T20:26:01.3064175Z 2025-05-07T20:26:01.3064354Z  2025-05-07T20:26:01.3064709Z 2025-05-07T20:26:01.3064714Z 2025-05-07T20:26:01.3064719Z 2025-05-07T20:26:01.3064724Z 2025-05-07T20:26:01.3064730Z 2025-05-07T20:26:01.3064735Z 2025-05-07T20:26:01.3064740Z 2025-05-07T20:26:01.3064745Z 2025-05-07T20:26:01.3064750Z 2025-05-07T20:26:01.3064755Z 2025-05-07T20:26:01.3064760Z 2025-05-07T20:26:01.3064765Z 2025-05-07T20:26:01.3064966Z  2025-05-07T20:26:01.3065222Z 2025-05-07T20:26:01.3065228Z 2025-05-07T20:26:01.3065233Z 2025-05-07T20:26:01.3065238Z 2025-05-07T20:26:01.3065243Z 2025-05-07T20:26:01.3065248Z 2025-05-07T20:26:01.3065254Z 2025-05-07T20:26:01.3065259Z 2025-05-07T20:26:01.3065264Z 2025-05-07T20:26:01.3065270Z 2025-05-07T20:26:01.3065275Z 2025-05-07T20:26:01.3065287Z 2025-05-07T20:26:01.3065292Z 2025-05-07T20:26:01.3065487Z  2025-05-07T20:26:01.3065753Z 2025-05-07T20:26:01.3065758Z 2025-05-07T20:26:01.3065763Z 2025-05-07T20:26:01.3065768Z 2025-05-07T20:26:01.3065780Z 2025-05-07T20:26:01.3065791Z 2025-05-07T20:26:01.3065804Z 2025-05-07T20:26:01.3065810Z 2025-05-07T20:26:01.3065814Z 2025-05-07T20:26:01.3065820Z 2025-05-07T20:26:01.3065825Z 2025-05-07T20:26:01.3065830Z 2025-05-07T20:26:01.3065835Z 2025-05-07T20:26:01.3065840Z 2025-05-07T20:26:01.3066043Z  2025-05-07T20:26:01.3066319Z 2025-05-07T20:26:01.3066325Z 2025-05-07T20:26:01.3066330Z 2025-05-07T20:26:01.3066335Z 2025-05-07T20:26:01.3066340Z 2025-05-07T20:26:01.3066345Z 2025-05-07T20:26:01.3066350Z 2025-05-07T20:26:01.3066355Z 2025-05-07T20:26:01.3066360Z 2025-05-07T20:26:01.3066365Z 2025-05-07T20:26:01.3066370Z 2025-05-07T20:26:01.3066375Z 2025-05-07T20:26:01.3066380Z 2025-05-07T20:26:01.3066385Z 2025-05-07T20:26:01.3066390Z 2025-05-07T20:26:01.3066599Z  2025-05-07T20:26:01.3066876Z 2025-05-07T20:26:01.3066881Z 2025-05-07T20:26:01.3066886Z 2025-05-07T20:26:01.3066891Z 2025-05-07T20:26:01.3066902Z 2025-05-07T20:26:01.3066913Z 2025-05-07T20:26:01.3066918Z 2025-05-07T20:26:01.3066923Z 2025-05-07T20:26:01.3066928Z 2025-05-07T20:26:01.3066933Z 2025-05-07T20:26:01.3066938Z 2025-05-07T20:26:01.3066950Z 2025-05-07T20:26:01.3066955Z 2025-05-07T20:26:01.3066960Z 2025-05-07T20:26:01.3066965Z 2025-05-07T20:26:01.3066970Z 2025-05-07T20:26:01.3067185Z  2025-05-07T20:26:01.3067488Z 2025-05-07T20:26:01.3067494Z 2025-05-07T20:26:01.3067499Z 2025-05-07T20:26:01.3067505Z 2025-05-07T20:26:01.3067510Z 2025-05-07T20:26:01.3067515Z 2025-05-07T20:26:01.3067520Z 2025-05-07T20:26:01.3067525Z 2025-05-07T20:26:01.3067530Z 2025-05-07T20:26:01.3067535Z 2025-05-07T20:26:01.3067540Z 2025-05-07T20:26:01.3067545Z 2025-05-07T20:26:01.3067550Z 2025-05-07T20:26:01.3067555Z 2025-05-07T20:26:01.3067560Z 2025-05-07T20:26:01.3067565Z 2025-05-07T20:26:01.3067571Z 2025-05-07T20:26:01.3067819Z  2025-05-07T20:26:01.3068124Z 2025-05-07T20:26:01.3068238Z 2025-05-07T20:26:01.3068244Z 2025-05-07T20:26:01.3068249Z 2025-05-07T20:26:01.3068254Z 2025-05-07T20:26:01.3068259Z 2025-05-07T20:26:01.3068265Z 2025-05-07T20:26:01.3068270Z 2025-05-07T20:26:01.3068275Z 2025-05-07T20:26:01.3068280Z 2025-05-07T20:26:01.3068285Z 2025-05-07T20:26:01.3068290Z 2025-05-07T20:26:01.3068304Z 2025-05-07T20:26:01.3068309Z 2025-05-07T20:26:01.3068314Z 2025-05-07T20:26:01.3068319Z 2025-05-07T20:26:01.3068324Z 2025-05-07T20:26:01.3068329Z 2025-05-07T20:26:01.3068583Z  2025-05-07T20:26:01.3068889Z 2025-05-07T20:26:01.3068895Z 2025-05-07T20:26:01.3069030Z  2025-05-07T20:26:01.3069167Z 2025-05-07T20:26:01.3069172Z 2025-05-07T20:26:01.3069317Z  2025-05-07T20:26:01.3069465Z 2025-05-07T20:26:01.3069470Z 2025-05-07T20:26:01.3069475Z 2025-05-07T20:26:01.3069635Z  2025-05-07T20:26:01.3069994Z 2025-05-07T20:26:01.3069999Z 2025-05-07T20:26:01.3070004Z 2025-05-07T20:26:01.3070116Z 2025-05-07T20:26:01.3070276Z  2025-05-07T20:26:01.3070439Z 2025-05-07T20:26:01.3070444Z 2025-05-07T20:26:01.3070449Z 2025-05-07T20:26:01.3070455Z 2025-05-07T20:26:01.3070460Z 2025-05-07T20:26:01.3070614Z  2025-05-07T20:26:01.3070788Z 2025-05-07T20:26:01.3070793Z 2025-05-07T20:26:01.3070799Z 2025-05-07T20:26:01.3070804Z 2025-05-07T20:26:01.3070809Z 2025-05-07T20:26:01.3070814Z 2025-05-07T20:26:01.3070974Z  2025-05-07T20:26:01.3071151Z 2025-05-07T20:26:01.3071156Z 2025-05-07T20:26:01.3071161Z 2025-05-07T20:26:01.3071167Z 2025-05-07T20:26:01.3071171Z 2025-05-07T20:26:01.3071177Z 2025-05-07T20:26:01.3071182Z 2025-05-07T20:26:01.3071334Z  2025-05-07T20:26:01.3071535Z 2025-05-07T20:26:01.3071540Z 2025-05-07T20:26:01.3071545Z 2025-05-07T20:26:01.3071551Z 2025-05-07T20:26:01.3071556Z 2025-05-07T20:26:01.3071561Z 2025-05-07T20:26:01.3071566Z 2025-05-07T20:26:01.3071571Z 2025-05-07T20:26:01.3071742Z  2025-05-07T20:26:01.3071966Z 2025-05-07T20:26:01.3071971Z 2025-05-07T20:26:01.3071976Z 2025-05-07T20:26:01.3071981Z 2025-05-07T20:26:01.3071986Z 2025-05-07T20:26:01.3071991Z 2025-05-07T20:26:01.3071995Z 2025-05-07T20:26:01.3072001Z 2025-05-07T20:26:01.3072005Z 2025-05-07T20:26:01.3072171Z  2025-05-07T20:26:01.3072389Z 2025-05-07T20:26:01.3072394Z 2025-05-07T20:26:01.3072399Z 2025-05-07T20:26:01.3072404Z 2025-05-07T20:26:01.3072409Z 2025-05-07T20:26:01.3072414Z 2025-05-07T20:26:01.3072420Z 2025-05-07T20:26:01.3072425Z 2025-05-07T20:26:01.3072430Z 2025-05-07T20:26:01.3072435Z 2025-05-07T20:26:01.3072633Z  2025-05-07T20:26:01.3072860Z 2025-05-07T20:26:01.3072865Z 2025-05-07T20:26:01.3072870Z 2025-05-07T20:26:01.3072875Z 2025-05-07T20:26:01.3072879Z 2025-05-07T20:26:01.3072885Z 2025-05-07T20:26:01.3072890Z 2025-05-07T20:26:01.3072895Z 2025-05-07T20:26:01.3072900Z 2025-05-07T20:26:01.3072905Z 2025-05-07T20:26:01.3072916Z 2025-05-07T20:26:01.3073108Z  2025-05-07T20:26:01.3073354Z 2025-05-07T20:26:01.3073360Z 2025-05-07T20:26:01.3073365Z 2025-05-07T20:26:01.3073370Z 2025-05-07T20:26:01.3073376Z 2025-05-07T20:26:01.3073381Z 2025-05-07T20:26:01.3073386Z 2025-05-07T20:26:01.3073391Z 2025-05-07T20:26:01.3073404Z 2025-05-07T20:26:01.3073409Z 2025-05-07T20:26:01.3073414Z 2025-05-07T20:26:01.3073428Z 2025-05-07T20:26:01.3073612Z  2025-05-07T20:26:01.3073865Z 2025-05-07T20:26:01.3073878Z 2025-05-07T20:26:01.3073884Z 2025-05-07T20:26:01.3073889Z 2025-05-07T20:26:01.3073894Z 2025-05-07T20:26:01.3073899Z 2025-05-07T20:26:01.3073904Z 2025-05-07T20:26:01.3073909Z 2025-05-07T20:26:01.3073914Z 2025-05-07T20:26:01.3073918Z 2025-05-07T20:26:01.3073923Z 2025-05-07T20:26:01.3073928Z 2025-05-07T20:26:01.3073933Z 2025-05-07T20:26:01.3074116Z  2025-05-07T20:26:01.3074389Z 2025-05-07T20:26:01.3074394Z 2025-05-07T20:26:01.3074526Z 2025-05-07T20:26:01.3074532Z 2025-05-07T20:26:01.3074538Z 2025-05-07T20:26:01.3074543Z 2025-05-07T20:26:01.3074548Z 2025-05-07T20:26:01.3074553Z 2025-05-07T20:26:01.3074558Z 2025-05-07T20:26:01.3074563Z 2025-05-07T20:26:01.3074568Z 2025-05-07T20:26:01.3074573Z 2025-05-07T20:26:01.3074578Z 2025-05-07T20:26:01.3074583Z 2025-05-07T20:26:01.3074812Z  2025-05-07T20:26:01.3075084Z 2025-05-07T20:26:01.3075090Z 2025-05-07T20:26:01.3075095Z 2025-05-07T20:26:01.3075100Z 2025-05-07T20:26:01.3075105Z 2025-05-07T20:26:01.3075110Z 2025-05-07T20:26:01.3075115Z 2025-05-07T20:26:01.3075120Z 2025-05-07T20:26:01.3075126Z 2025-05-07T20:26:01.3075139Z 2025-05-07T20:26:01.3075145Z 2025-05-07T20:26:01.3075150Z 2025-05-07T20:26:01.3075155Z 2025-05-07T20:26:01.3075160Z 2025-05-07T20:26:01.3075165Z 2025-05-07T20:26:01.3075364Z  2025-05-07T20:26:01.3075641Z 2025-05-07T20:26:01.3075654Z 2025-05-07T20:26:01.3075783Z 2025-05-07T20:26:01.3075794Z 2025-05-07T20:26:01.3075799Z 2025-05-07T20:26:01.3075804Z 2025-05-07T20:26:01.3075809Z 2025-05-07T20:26:01.3075814Z 2025-05-07T20:26:01.3075819Z 2025-05-07T20:26:01.3075824Z 2025-05-07T20:26:01.3075829Z 2025-05-07T20:26:01.3075834Z 2025-05-07T20:26:01.3075839Z 2025-05-07T20:26:01.3075844Z 2025-05-07T20:26:01.3075849Z 2025-05-07T20:26:01.3075854Z 2025-05-07T20:26:01.3076086Z  2025-05-07T20:26:01.3076377Z 2025-05-07T20:26:01.3076383Z 2025-05-07T20:26:01.3076388Z 2025-05-07T20:26:01.3076393Z 2025-05-07T20:26:01.3076398Z 2025-05-07T20:26:01.3076404Z 2025-05-07T20:26:01.3076409Z 2025-05-07T20:26:01.3076415Z 2025-05-07T20:26:01.3076420Z 2025-05-07T20:26:01.3076425Z 2025-05-07T20:26:01.3076430Z 2025-05-07T20:26:01.3076435Z 2025-05-07T20:26:01.3076447Z 2025-05-07T20:26:01.3076452Z 2025-05-07T20:26:01.3076457Z 2025-05-07T20:26:01.3076462Z 2025-05-07T20:26:01.3076467Z 2025-05-07T20:26:01.3076703Z  2025-05-07T20:26:01.3076991Z 2025-05-07T20:26:01.3076996Z 2025-05-07T20:26:01.3077009Z 2025-05-07T20:26:01.3077015Z 2025-05-07T20:26:01.3077020Z 2025-05-07T20:26:01.3077025Z 2025-05-07T20:26:01.3077030Z 2025-05-07T20:26:01.3077035Z 2025-05-07T20:26:01.3077040Z 2025-05-07T20:26:01.3077045Z 2025-05-07T20:26:01.3077050Z 2025-05-07T20:26:01.3077055Z 2025-05-07T20:26:01.3077060Z 2025-05-07T20:26:01.3077065Z 2025-05-07T20:26:01.3077070Z 2025-05-07T20:26:01.3077075Z 2025-05-07T20:26:01.3077080Z 2025-05-07T20:26:01.3077085Z 2025-05-07T20:26:01.3077325Z  2025-05-07T20:26:01.3077620Z 2025-05-07T20:26:01.3077625Z 2025-05-07T20:26:01.3077755Z  2025-05-07T20:26:01.3077897Z 2025-05-07T20:26:01.3077903Z 2025-05-07T20:26:01.3078034Z  2025-05-07T20:26:01.3078173Z 2025-05-07T20:26:01.3078178Z 2025-05-07T20:26:01.3078194Z 2025-05-07T20:26:01.3078339Z  2025-05-07T20:26:01.3078488Z 2025-05-07T20:26:01.3078500Z 2025-05-07T20:26:01.3078505Z 2025-05-07T20:26:01.3078509Z 2025-05-07T20:26:01.3078665Z  2025-05-07T20:26:01.3078824Z 2025-05-07T20:26:01.3078830Z 2025-05-07T20:26:01.3078835Z 2025-05-07T20:26:01.3078840Z 2025-05-07T20:26:01.3078845Z 2025-05-07T20:26:01.3079002Z  2025-05-07T20:26:01.3079172Z 2025-05-07T20:26:01.3079178Z 2025-05-07T20:26:01.3079183Z 2025-05-07T20:26:01.3079188Z 2025-05-07T20:26:01.3079193Z 2025-05-07T20:26:01.3079198Z 2025-05-07T20:26:01.3079369Z  2025-05-07T20:26:01.3079540Z 2025-05-07T20:26:01.3079546Z 2025-05-07T20:26:01.3079551Z 2025-05-07T20:26:01.3079556Z 2025-05-07T20:26:01.3079561Z 2025-05-07T20:26:01.3079566Z 2025-05-07T20:26:01.3079571Z 2025-05-07T20:26:01.3079732Z  2025-05-07T20:26:01.3079924Z 2025-05-07T20:26:01.3079929Z 2025-05-07T20:26:01.3079934Z 2025-05-07T20:26:01.3079939Z 2025-05-07T20:26:01.3079944Z 2025-05-07T20:26:01.3079949Z 2025-05-07T20:26:01.3079959Z 2025-05-07T20:26:01.3080075Z 2025-05-07T20:26:01.3080263Z  2025-05-07T20:26:01.3080464Z 2025-05-07T20:26:01.3080469Z 2025-05-07T20:26:01.3080474Z 2025-05-07T20:26:01.3080479Z 2025-05-07T20:26:01.3080484Z 2025-05-07T20:26:01.3080489Z 2025-05-07T20:26:01.3080494Z 2025-05-07T20:26:01.3080499Z 2025-05-07T20:26:01.3080513Z 2025-05-07T20:26:01.3080676Z  2025-05-07T20:26:01.3080886Z 2025-05-07T20:26:01.3080892Z 2025-05-07T20:26:01.3080897Z 2025-05-07T20:26:01.3080902Z 2025-05-07T20:26:01.3080907Z 2025-05-07T20:26:01.3080912Z 2025-05-07T20:26:01.3080924Z 2025-05-07T20:26:01.3080929Z 2025-05-07T20:26:01.3080934Z 2025-05-07T20:26:01.3080939Z 2025-05-07T20:26:01.3081123Z  2025-05-07T20:26:01.3081344Z 2025-05-07T20:26:01.3081349Z 2025-05-07T20:26:01.3081364Z 2025-05-07T20:26:01.3081369Z 2025-05-07T20:26:01.3081374Z 2025-05-07T20:26:01.3081379Z 2025-05-07T20:26:01.3081384Z 2025-05-07T20:26:01.3081389Z 2025-05-07T20:26:01.3081489Z 2025-05-07T20:26:01.3081504Z 2025-05-07T20:26:01.3081509Z 2025-05-07T20:26:01.3081689Z  2025-05-07T20:26:01.3081936Z 2025-05-07T20:26:01.3081942Z 2025-05-07T20:26:01.3081947Z 2025-05-07T20:26:01.3081952Z 2025-05-07T20:26:01.3081957Z 2025-05-07T20:26:01.3081963Z 2025-05-07T20:26:01.3081968Z 2025-05-07T20:26:01.3081973Z 2025-05-07T20:26:01.3081978Z 2025-05-07T20:26:01.3081984Z 2025-05-07T20:26:01.3081989Z 2025-05-07T20:26:01.3081994Z 2025-05-07T20:26:01.3082197Z  2025-05-07T20:26:01.3082619Z 2025-05-07T20:26:01.3082625Z 2025-05-07T20:26:01.3082630Z 2025-05-07T20:26:01.3082635Z 2025-05-07T20:26:01.3082640Z 2025-05-07T20:26:01.3082646Z 2025-05-07T20:26:01.3082650Z 2025-05-07T20:26:01.3082655Z 2025-05-07T20:26:01.3082661Z 2025-05-07T20:26:01.3082666Z 2025-05-07T20:26:01.3082671Z 2025-05-07T20:26:01.3082676Z 2025-05-07T20:26:01.3082681Z 2025-05-07T20:26:01.3083103Z  2025-05-07T20:26:01.3083394Z 2025-05-07T20:26:01.3083399Z 2025-05-07T20:26:01.3083404Z 2025-05-07T20:26:01.3083409Z 2025-05-07T20:26:01.3083414Z 2025-05-07T20:26:01.3083420Z 2025-05-07T20:26:01.3083425Z 2025-05-07T20:26:01.3083430Z 2025-05-07T20:26:01.3083444Z 2025-05-07T20:26:01.3083449Z 2025-05-07T20:26:01.3083454Z 2025-05-07T20:26:01.3083459Z 2025-05-07T20:26:01.3083464Z 2025-05-07T20:26:01.3083469Z 2025-05-07T20:26:01.3083675Z  2025-05-07T20:26:01.3083945Z 2025-05-07T20:26:01.3083962Z 2025-05-07T20:26:01.3083968Z 2025-05-07T20:26:01.3083973Z 2025-05-07T20:26:01.3083978Z 2025-05-07T20:26:01.3083983Z 2025-05-07T20:26:01.3083988Z 2025-05-07T20:26:01.3083993Z 2025-05-07T20:26:01.3083998Z 2025-05-07T20:26:01.3084003Z 2025-05-07T20:26:01.3084008Z 2025-05-07T20:26:01.3084013Z 2025-05-07T20:26:01.3084018Z 2025-05-07T20:26:01.3084023Z 2025-05-07T20:26:01.3084028Z 2025-05-07T20:26:01.3084248Z  2025-05-07T20:26:01.3084537Z 2025-05-07T20:26:01.3084543Z 2025-05-07T20:26:01.3084548Z 2025-05-07T20:26:01.3084553Z 2025-05-07T20:26:01.3084558Z 2025-05-07T20:26:01.3084563Z 2025-05-07T20:26:01.3084568Z 2025-05-07T20:26:01.3084573Z 2025-05-07T20:26:01.3084578Z 2025-05-07T20:26:01.3084583Z 2025-05-07T20:26:01.3084588Z 2025-05-07T20:26:01.3084593Z 2025-05-07T20:26:01.3084723Z 2025-05-07T20:26:01.3084736Z 2025-05-07T20:26:01.3084741Z 2025-05-07T20:26:01.3084747Z 2025-05-07T20:26:01.3084954Z  2025-05-07T20:26:01.3085239Z 2025-05-07T20:26:01.3085244Z 2025-05-07T20:26:01.3085249Z 2025-05-07T20:26:01.3085255Z 2025-05-07T20:26:01.3085266Z 2025-05-07T20:26:01.3085272Z 2025-05-07T20:26:01.3085277Z 2025-05-07T20:26:01.3085282Z 2025-05-07T20:26:01.3085287Z 2025-05-07T20:26:01.3085292Z 2025-05-07T20:26:01.3085297Z 2025-05-07T20:26:01.3085302Z 2025-05-07T20:26:01.3085307Z 2025-05-07T20:26:01.3085312Z 2025-05-07T20:26:01.3085317Z 2025-05-07T20:26:01.3085328Z 2025-05-07T20:26:01.3085501Z 2025-05-07T20:26:01.3085723Z  2025-05-07T20:26:01.3086022Z 2025-05-07T20:26:01.3086027Z 2025-05-07T20:26:01.3086032Z 2025-05-07T20:26:01.3086037Z 2025-05-07T20:26:01.3086042Z 2025-05-07T20:26:01.3086047Z 2025-05-07T20:26:01.3086052Z 2025-05-07T20:26:01.3086057Z 2025-05-07T20:26:01.3086062Z 2025-05-07T20:26:01.3086067Z 2025-05-07T20:26:01.3086072Z 2025-05-07T20:26:01.3086077Z 2025-05-07T20:26:01.3086082Z 2025-05-07T20:26:01.3086087Z 2025-05-07T20:26:01.3086092Z 2025-05-07T20:26:01.3086097Z 2025-05-07T20:26:01.3086102Z 2025-05-07T20:26:01.3086107Z 2025-05-07T20:26:01.3086344Z  2025-05-07T20:26:01.3086643Z 2025-05-07T20:26:01.3086649Z 2025-05-07T20:26:01.3086792Z  2025-05-07T20:26:01.3086929Z 2025-05-07T20:26:01.3086935Z 2025-05-07T20:26:01.3087069Z  2025-05-07T20:26:01.3087217Z 2025-05-07T20:26:01.3087222Z 2025-05-07T20:26:01.3087363Z 2025-05-07T20:26:01.3087523Z  2025-05-07T20:26:01.3087679Z 2025-05-07T20:26:01.3087684Z 2025-05-07T20:26:01.3087689Z 2025-05-07T20:26:01.3087693Z 2025-05-07T20:26:01.3087835Z  2025-05-07T20:26:01.3087989Z 2025-05-07T20:26:01.3087994Z 2025-05-07T20:26:01.3088007Z 2025-05-07T20:26:01.3088013Z 2025-05-07T20:26:01.3088018Z 2025-05-07T20:26:01.3088176Z  2025-05-07T20:26:01.3088343Z 2025-05-07T20:26:01.3088355Z 2025-05-07T20:26:01.3088361Z 2025-05-07T20:26:01.3088366Z 2025-05-07T20:26:01.3088371Z 2025-05-07T20:26:01.3088376Z 2025-05-07T20:26:01.3088532Z  2025-05-07T20:26:01.3088707Z 2025-05-07T20:26:01.3088712Z 2025-05-07T20:26:01.3088726Z 2025-05-07T20:26:01.3088731Z 2025-05-07T20:26:01.3088736Z 2025-05-07T20:26:01.3088742Z 2025-05-07T20:26:01.3088747Z 2025-05-07T20:26:01.3088904Z  2025-05-07T20:26:01.3089090Z 2025-05-07T20:26:01.3089095Z 2025-05-07T20:26:01.3089101Z 2025-05-07T20:26:01.3089113Z 2025-05-07T20:26:01.3089127Z 2025-05-07T20:26:01.3089136Z 2025-05-07T20:26:01.3089141Z 2025-05-07T20:26:01.3089146Z 2025-05-07T20:26:01.3089305Z  2025-05-07T20:26:01.3089507Z 2025-05-07T20:26:01.3089513Z 2025-05-07T20:26:01.3089518Z 2025-05-07T20:26:01.3089530Z 2025-05-07T20:26:01.3089535Z 2025-05-07T20:26:01.3089540Z 2025-05-07T20:26:01.3089546Z 2025-05-07T20:26:01.3089551Z 2025-05-07T20:26:01.3089556Z 2025-05-07T20:26:01.3089720Z  2025-05-07T20:26:01.3089938Z 2025-05-07T20:26:01.3089950Z 2025-05-07T20:26:01.3089955Z 2025-05-07T20:26:01.3089960Z 2025-05-07T20:26:01.3089965Z 2025-05-07T20:26:01.3089970Z 2025-05-07T20:26:01.3089976Z 2025-05-07T20:26:01.3089981Z 2025-05-07T20:26:01.3089986Z 2025-05-07T20:26:01.3089991Z 2025-05-07T20:26:01.3090164Z  2025-05-07T20:26:01.3090397Z 2025-05-07T20:26:01.3090402Z 2025-05-07T20:26:01.3090408Z 2025-05-07T20:26:01.3090413Z 2025-05-07T20:26:01.3090418Z 2025-05-07T20:26:01.3090455Z 2025-05-07T20:26:01.3090465Z 2025-05-07T20:26:01.3090470Z 2025-05-07T20:26:01.3090475Z 2025-05-07T20:26:01.3090479Z 2025-05-07T20:26:01.3090484Z 2025-05-07T20:26:01.3090667Z  2025-05-07T20:26:01.3090906Z 2025-05-07T20:26:01.3090911Z 2025-05-07T20:26:01.3090916Z 2025-05-07T20:26:01.3090921Z 2025-05-07T20:26:01.3090926Z 2025-05-07T20:26:01.3090931Z 2025-05-07T20:26:01.3090936Z 2025-05-07T20:26:01.3090941Z 2025-05-07T20:26:01.3090946Z 2025-05-07T20:26:01.3090952Z 2025-05-07T20:26:01.3090957Z 2025-05-07T20:26:01.3090962Z 2025-05-07T20:26:01.3091144Z  2025-05-07T20:26:01.3091394Z 2025-05-07T20:26:01.3091399Z 2025-05-07T20:26:01.3091404Z 2025-05-07T20:26:01.3091410Z 2025-05-07T20:26:01.3091415Z 2025-05-07T20:26:01.3091420Z 2025-05-07T20:26:01.3091425Z 2025-05-07T20:26:01.3091437Z 2025-05-07T20:26:01.3091443Z 2025-05-07T20:26:01.3091448Z 2025-05-07T20:26:01.3091453Z 2025-05-07T20:26:01.3091458Z 2025-05-07T20:26:01.3091469Z 2025-05-07T20:26:01.3091751Z  2025-05-07T20:26:01.3092018Z 2025-05-07T20:26:01.3092023Z 2025-05-07T20:26:01.3092036Z 2025-05-07T20:26:01.3092041Z 2025-05-07T20:26:01.3092046Z 2025-05-07T20:26:01.3092051Z 2025-05-07T20:26:01.3092056Z 2025-05-07T20:26:01.3092061Z 2025-05-07T20:26:01.3092066Z 2025-05-07T20:26:01.3092071Z 2025-05-07T20:26:01.3092076Z 2025-05-07T20:26:01.3092081Z 2025-05-07T20:26:01.3092086Z 2025-05-07T20:26:01.3092091Z 2025-05-07T20:26:01.3092283Z  2025-05-07T20:26:01.3092559Z 2025-05-07T20:26:01.3092564Z 2025-05-07T20:26:01.3092570Z 2025-05-07T20:26:01.3092575Z 2025-05-07T20:26:01.3092580Z 2025-05-07T20:26:01.3092585Z 2025-05-07T20:26:01.3092590Z 2025-05-07T20:26:01.3092595Z 2025-05-07T20:26:01.3092600Z 2025-05-07T20:26:01.3092605Z 2025-05-07T20:26:01.3092610Z 2025-05-07T20:26:01.3092615Z 2025-05-07T20:26:01.3092621Z 2025-05-07T20:26:01.3092626Z 2025-05-07T20:26:01.3092631Z 2025-05-07T20:26:01.3092927Z  2025-05-07T20:26:01.3093203Z 2025-05-07T20:26:01.3093208Z 2025-05-07T20:26:01.3093213Z 2025-05-07T20:26:01.3093218Z 2025-05-07T20:26:01.3093223Z 2025-05-07T20:26:01.3093228Z 2025-05-07T20:26:01.3093233Z 2025-05-07T20:26:01.3093238Z 2025-05-07T20:26:01.3093243Z 2025-05-07T20:26:01.3093247Z 2025-05-07T20:26:01.3093252Z 2025-05-07T20:26:01.3093264Z 2025-05-07T20:26:01.3093269Z 2025-05-07T20:26:01.3093274Z 2025-05-07T20:26:01.3093279Z 2025-05-07T20:26:01.3093284Z 2025-05-07T20:26:01.3093490Z  2025-05-07T20:26:01.3093781Z 2025-05-07T20:26:01.3093786Z 2025-05-07T20:26:01.3093800Z 2025-05-07T20:26:01.3093805Z 2025-05-07T20:26:01.3093810Z 2025-05-07T20:26:01.3093815Z 2025-05-07T20:26:01.3093820Z 2025-05-07T20:26:01.3093824Z 2025-05-07T20:26:01.3093829Z 2025-05-07T20:26:01.3093834Z 2025-05-07T20:26:01.3093839Z 2025-05-07T20:26:01.3093844Z 2025-05-07T20:26:01.3093849Z 2025-05-07T20:26:01.3093862Z 2025-05-07T20:26:01.3093873Z 2025-05-07T20:26:01.3093878Z 2025-05-07T20:26:01.3093883Z 2025-05-07T20:26:01.3094099Z  2025-05-07T20:26:01.3094404Z 2025-05-07T20:26:01.3094410Z 2025-05-07T20:26:01.3094415Z 2025-05-07T20:26:01.3094421Z 2025-05-07T20:26:01.3094426Z 2025-05-07T20:26:01.3094456Z 2025-05-07T20:26:01.3094460Z 2025-05-07T20:26:01.3094465Z 2025-05-07T20:26:01.3094470Z 2025-05-07T20:26:01.3094486Z 2025-05-07T20:26:01.3094492Z 2025-05-07T20:26:01.3094497Z 2025-05-07T20:26:01.3094502Z 2025-05-07T20:26:01.3094507Z 2025-05-07T20:26:01.3094512Z 2025-05-07T20:26:01.3094517Z 2025-05-07T20:26:01.3094522Z 2025-05-07T20:26:01.3094527Z 2025-05-07T20:26:01.3094750Z  2025-05-07T20:26:01.3095053Z 2025-05-07T20:26:01.3095058Z 2025-05-07T20:26:01.3095191Z  2025-05-07T20:26:01.3095327Z 2025-05-07T20:26:01.3095332Z 2025-05-07T20:26:01.3095481Z  2025-05-07T20:26:01.3095630Z 2025-05-07T20:26:01.3095638Z 2025-05-07T20:26:01.3095642Z 2025-05-07T20:26:01.3095752Z  2025-05-07T20:26:01.3095986Z 2025-05-07T20:26:01.3095990Z 2025-05-07T20:26:01.3095994Z 2025-05-07T20:26:01.3095997Z 2025-05-07T20:26:01.3096108Z  2025-05-07T20:26:01.3096241Z 2025-05-07T20:26:01.3096245Z 2025-05-07T20:26:01.3096248Z 2025-05-07T20:26:01.3096252Z 2025-05-07T20:26:01.3096256Z 2025-05-07T20:26:01.3096365Z  2025-05-07T20:26:01.3096493Z 2025-05-07T20:26:01.3096496Z 2025-05-07T20:26:01.3096500Z 2025-05-07T20:26:01.3096503Z 2025-05-07T20:26:01.3096507Z 2025-05-07T20:26:01.3096510Z 2025-05-07T20:26:01.3096625Z  2025-05-07T20:26:01.3096781Z 2025-05-07T20:26:01.3096787Z 2025-05-07T20:26:01.3096791Z 2025-05-07T20:26:01.3096796Z 2025-05-07T20:26:01.3096801Z 2025-05-07T20:26:01.3096806Z 2025-05-07T20:26:01.3096811Z 2025-05-07T20:26:01.3096977Z  2025-05-07T20:26:01.3097175Z 2025-05-07T20:26:01.3097181Z 2025-05-07T20:26:01.3097194Z 2025-05-07T20:26:01.3097325Z 2025-05-07T20:26:01.3097329Z 2025-05-07T20:26:01.3097333Z 2025-05-07T20:26:01.3097336Z 2025-05-07T20:26:01.3097340Z 2025-05-07T20:26:01.3097471Z  2025-05-07T20:26:01.3097629Z 2025-05-07T20:26:01.3097633Z 2025-05-07T20:26:01.3097636Z 2025-05-07T20:26:01.3097640Z 2025-05-07T20:26:01.3097643Z 2025-05-07T20:26:01.3097647Z 2025-05-07T20:26:01.3097651Z 2025-05-07T20:26:01.3097654Z 2025-05-07T20:26:01.3097658Z 2025-05-07T20:26:01.3097780Z  2025-05-07T20:26:01.3097942Z 2025-05-07T20:26:01.3097946Z 2025-05-07T20:26:01.3097949Z 2025-05-07T20:26:01.3097953Z 2025-05-07T20:26:01.3097957Z 2025-05-07T20:26:01.3097960Z 2025-05-07T20:26:01.3097964Z 2025-05-07T20:26:01.3097967Z 2025-05-07T20:26:01.3097971Z 2025-05-07T20:26:01.3097975Z 2025-05-07T20:26:01.3098100Z  2025-05-07T20:26:01.3098271Z 2025-05-07T20:26:01.3098275Z 2025-05-07T20:26:01.3098278Z 2025-05-07T20:26:01.3098360Z 2025-05-07T20:26:01.3098370Z 2025-05-07T20:26:01.3098374Z 2025-05-07T20:26:01.3098377Z 2025-05-07T20:26:01.3098381Z 2025-05-07T20:26:01.3098384Z 2025-05-07T20:26:01.3098388Z 2025-05-07T20:26:01.3098391Z 2025-05-07T20:26:01.3098523Z  2025-05-07T20:26:01.3098709Z 2025-05-07T20:26:01.3098712Z 2025-05-07T20:26:01.3098716Z 2025-05-07T20:26:01.3098719Z 2025-05-07T20:26:01.3098723Z 2025-05-07T20:26:01.3098726Z 2025-05-07T20:26:01.3098730Z 2025-05-07T20:26:01.3098734Z 2025-05-07T20:26:01.3098737Z 2025-05-07T20:26:01.3098741Z 2025-05-07T20:26:01.3098744Z 2025-05-07T20:26:01.3098748Z 2025-05-07T20:26:01.3098936Z  2025-05-07T20:26:01.3099150Z 2025-05-07T20:26:01.3099153Z 2025-05-07T20:26:01.3099157Z 2025-05-07T20:26:01.3099160Z 2025-05-07T20:26:01.3099164Z 2025-05-07T20:26:01.3099168Z 2025-05-07T20:26:01.3099171Z 2025-05-07T20:26:01.3099175Z 2025-05-07T20:26:01.3099178Z 2025-05-07T20:26:01.3099182Z 2025-05-07T20:26:01.3099192Z 2025-05-07T20:26:01.3099217Z 2025-05-07T20:26:01.3099220Z 2025-05-07T20:26:01.3099362Z  2025-05-07T20:26:01.3099558Z 2025-05-07T20:26:01.3099562Z 2025-05-07T20:26:01.3099565Z 2025-05-07T20:26:01.3099575Z 2025-05-07T20:26:01.3099579Z 2025-05-07T20:26:01.3099582Z 2025-05-07T20:26:01.3099586Z 2025-05-07T20:26:01.3099589Z 2025-05-07T20:26:01.3099593Z 2025-05-07T20:26:01.3099597Z 2025-05-07T20:26:01.3099600Z 2025-05-07T20:26:01.3099604Z 2025-05-07T20:26:01.3099607Z 2025-05-07T20:26:01.3099611Z 2025-05-07T20:26:01.3099765Z  done 2025-05-07T20:26:01.6308361Z Preparing transaction: - \ | done 2025-05-07T20:26:05.4833164Z Verifying transaction: - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ done 2025-05-07T20:26:06.5016311Z Executing transaction: / - \ | / - \ | / - done 2025-05-07T20:26:08.8586362Z [INSTALL] Fixing file placements for CUDA 12.8.0+ ... 2025-05-07T20:26:08.8586847Z [INSTALL] Creating symlinks: libnvToolsExt.so 2025-05-07T20:26:08.8587617Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/lib/libnvToolsExt.so.1 /home/ec2-user/miniconda/envs/build_binary/lib/libnvToolsExt.so 2025-05-07T20:26:08.8588253Z 2025-05-07T20:26:08.8601583Z 2025-05-07T20:26:08.8602593Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/libnvToolsExt.so.1 /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/libnvToolsExt.so 2025-05-07T20:26:08.8603391Z 2025-05-07T20:26:08.8614137Z 2025-05-07T20:26:08.8614475Z [INSTALL] Copying nvtx3 headers ... 2025-05-07T20:26:08.8619948Z + cp -r /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExt.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtCuda.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtCudaRt.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtOpenCL.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtSync.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvtx3.hpp /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvtxDetail /home/ec2-user/miniconda/envs/build_binary/include/ 2025-05-07T20:26:08.8624356Z 2025-05-07T20:26:09.0319274Z 2025-05-07T20:26:09.0325304Z + cp -r /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExt.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtCuda.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtCudaRt.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtOpenCL.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtSync.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvtx3.hpp /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvtxDetail /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/include/ 2025-05-07T20:26:09.0329833Z 2025-05-07T20:26:09.0346632Z 2025-05-07T20:26:09.0347224Z [INSTALL] Appending libcuda.so path to LD_LIBRARY_PATH ... 2025-05-07T20:26:09.0721538Z [ENV] Appending to LD_LIBRARY_PATH: /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/stubs ... 2025-05-07T20:26:10.9472576Z ERROR conda.cli.main_run:execute(125): `conda run printenv LD_LIBRARY_PATH` failed. (See above for error) 2025-05-07T20:26:11.0095787Z + conda env config vars set -n build_binary LD_LIBRARY_PATH=/home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/stubs 2025-05-07T20:26:11.0096358Z 2025-05-07T20:26:11.4301678Z 2025-05-07T20:26:11.4309783Z [INSTALL] Setting environment variable NVML_LIB_PATH ... 2025-05-07T20:26:11.4658620Z + conda env config vars set -n build_binary NVML_LIB_PATH=/home/ec2-user/miniconda/envs/build_binary/lib/stubs/libnvidia-ml.so 2025-05-07T20:26:11.4659162Z 2025-05-07T20:26:11.8984901Z 2025-05-07T20:26:11.8985513Z [INSTALL] Setting environment variable CUDA_INCLUDE_DIRS ... 2025-05-07T20:26:11.8986596Z + conda env config vars set -n build_binary CUDA_INCLUDE_DIRS="/home/ec2-user/miniconda/envs/build_binary/include/:/home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/include/" 2025-05-07T20:26:11.8987407Z 2025-05-07T20:26:12.3215475Z 2025-05-07T20:26:14.3451645Z [CHECK] cuda_runtime.h found in CONDA_PREFIX PATH (file): /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/include/cuda_runtime.h 2025-05-07T20:26:16.3591631Z [CHECK] libcuda.so found in CONDA_PREFIX PATH (file): /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/stubs/libcuda.so 2025-05-07T20:26:18.4000862Z [CHECK] libnvToolsExt.so found in CONDA_PREFIX PATH (symbolic link): /home/ec2-user/miniconda/envs/build_binary/lib/libnvToolsExt.so 2025-05-07T20:26:18.4001907Z /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/libnvToolsExt.so 2025-05-07T20:26:20.4408119Z [CHECK] libnvidia-ml.so found in CONDA_PREFIX PATH (file): /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/stubs/libnvidia-ml.so 2025-05-07T20:26:22.3364026Z /home/ec2-user/miniconda/envs/build_binary/bin/nvcc 2025-05-07T20:26:22.3364424Z 2025-05-07T20:26:22.3994133Z [CHECK] Binary nvcc found in PATH 2025-05-07T20:26:26.2509540Z /tmp/tmp56xggd1p: line 3: clang: command not found 2025-05-07T20:26:26.2509969Z 2025-05-07T20:26:26.2512617Z ERROR conda.cli.main_run:execute(125): `conda run clang --version` failed. (See above for error) 2025-05-07T20:26:26.3144393Z + ls -la /home/ec2-user/miniconda/envs/build_binary/etc/conda/activate.d 2025-05-07T20:26:26.3144732Z 2025-05-07T20:26:26.3166004Z total 36 2025-05-07T20:26:26.3166296Z drwxr-xr-x. 2 ec2-user ec2-user 191 May 7 20:26 . 2025-05-07T20:26:26.3166678Z drwxr-xr-x. 5 ec2-user ec2-user 62 May 7 20:24 .. 2025-05-07T20:26:26.3167143Z -rw-r--r--. 2 ec2-user ec2-user 3778 Jun 10 2024 activate-binutils_linux-64.sh 2025-05-07T20:26:26.3167675Z -rw-r--r--. 2 ec2-user ec2-user 11630 Jun 10 2024 activate-gcc_linux-64.sh 2025-05-07T20:26:26.3168181Z -rw-r--r--. 2 ec2-user ec2-user 5190 Jun 10 2024 activate-gxx_linux-64.sh 2025-05-07T20:26:26.3168657Z -rw-r--r--. 2 ec2-user ec2-user 136 Mar 27 01:27 libglib_activate.sh 2025-05-07T20:26:26.3169115Z -rw-r--r--. 2 ec2-user ec2-user 872 Nov 13 09:20 libxml2_activate.sh 2025-05-07T20:26:26.3169596Z -rw-r--r--. 2 ec2-user ec2-user 2932 Jan 24 22:22 ~cuda-nvcc_activate.sh 2025-05-07T20:26:26.3170115Z 2025-05-07T20:26:26.3170343Z [INSTALL] Removing the -ccbin=CXX hook from NVCC activation scripts ... 2025-05-07T20:26:26.3171019Z + sed -i /-ccbin=/d /home/ec2-user/miniconda/envs/build_binary/etc/conda/activate.d/*cuda-nvcc_activate.sh 2025-05-07T20:26:26.3171475Z 2025-05-07T20:26:26.3191935Z 2025-05-07T20:26:26.3192356Z + conda run -n build_binary c++ --version | grep -i clang 2025-05-07T20:26:26.3192636Z 2025-05-07T20:26:28.2791825Z 2025-05-07T20:26:28.2792207Z [BUILD] Setting prepend flags for NVCC ... 2025-05-07T20:26:28.2792789Z + conda env config vars set -n build_binary NVCC_PREPEND_FLAGS="-allow-unsupported-compiler" 2025-05-07T20:26:28.2793195Z 2025-05-07T20:26:28.7042784Z 2025-05-07T20:26:28.7043126Z + conda run -n build_binary printenv NVCC_PREPEND_FLAGS 2025-05-07T20:26:28.7043405Z 2025-05-07T20:26:30.5910112Z -allow-unsupported-compiler 2025-05-07T20:26:30.5910413Z 2025-05-07T20:26:30.6535031Z 2025-05-07T20:26:30.6535342Z [INFO] Printing out all preprocessor defines in nvcc ... 2025-05-07T20:26:30.6535883Z + conda run -n build_binary nvcc --compiler-options -dM -E -x cu - < /dev/null 2025-05-07T20:26:30.6536276Z 2025-05-07T20:26:32.6017424Z #define _GLIBCXX_DEPRECATED_SUGGEST(ALT) __attribute__ ((__deprecated__ ("use '" ALT "' instead"))) 2025-05-07T20:26:32.6018044Z #define M_PIl 3.141592653589793238462643383279502884L 2025-05-07T20:26:32.6018383Z #define _IO_CURRENTLY_PUTTING 0x800 2025-05-07T20:26:32.6018707Z #define __W_EXITCODE(ret,sig) ((ret) << 8 | (sig)) 2025-05-07T20:26:32.6019031Z #define __DBL_MIN_EXP__ (-1021) 2025-05-07T20:26:32.6019466Z #define _STL_PAIR_H 1 2025-05-07T20:26:32.6019722Z #define __cpp_attributes 200809L 2025-05-07T20:26:32.6020115Z #define __cpp_nontype_template_parameter_auto 201606L 2025-05-07T20:26:32.6020597Z #define __DELETE_THROW throw() 2025-05-07T20:26:32.6020957Z #define _PTRDIFF_T_ 2025-05-07T20:26:32.6021283Z #define M_PI_4 0.78539816339744830962 2025-05-07T20:26:32.6021674Z #define __UINT_LEAST16_MAX__ 0xffff 2025-05-07T20:26:32.6021962Z #define _IO_LEFT 02 2025-05-07T20:26:32.6022182Z #define __ATOMIC_ACQUIRE 2 2025-05-07T20:26:32.6022456Z #define _POSIX2_BC_SCALE_MAX 99 2025-05-07T20:26:32.6022729Z #define _GLIBCXX_USE_RANDOM_TR1 1 2025-05-07T20:26:32.6023166Z #define _GLIBCXX_MOVE_BACKWARD3(_Tp,_Up,_Vp) std::move_backward(_Tp, _Up, _Vp) 2025-05-07T20:26:32.6023695Z #define __FLT128_MAX_10_EXP__ 4932 2025-05-07T20:26:32.6024038Z #define RE_DUP_MAX (0x7fff) 2025-05-07T20:26:32.6024297Z #define _IOS_OUTPUT 2 2025-05-07T20:26:32.6024533Z #define __SM_100_RT_HPP__ 2025-05-07T20:26:32.6024855Z #define __FLT_MIN__ 1.17549435082228750796873653722224568e-38F 2025-05-07T20:26:32.6025330Z #define toascii_l(c,l) __toascii_l ((c), (l)) 2025-05-07T20:26:32.6025753Z #define __GCC_IEC_559_COMPLEX 2 2025-05-07T20:26:32.6026121Z #define _GLIBCXX_USE_FCHMOD 1 2025-05-07T20:26:32.6026471Z #define __cpp_aggregate_nsdmi 201304L 2025-05-07T20:26:32.6037973Z #define __bswap_16(x) (__extension__ ({ unsigned short int __v, __x = (unsigned short int) (x); if (__builtin_constant_p (__x)) __v = __bswap_constant_16 (__x); else __asm__ ("rorw $8, %w0" : "=r" (__v) : "0" (__x) : "cc"); __v; })) 2025-05-07T20:26:32.6039211Z #define __UINT_LEAST8_TYPE__ unsigned char 2025-05-07T20:26:32.6039618Z #define __SIZEOF_FLOAT80__ 16 2025-05-07T20:26:32.6039929Z #define cudaTextureTypeCubemapLayered 0xFC 2025-05-07T20:26:32.6040247Z #define _T_WCHAR_ 2025-05-07T20:26:32.6040466Z #define stdout stdout 2025-05-07T20:26:32.6040804Z #define _GLIBCXX_ABI_TAG_CXX11 __attribute ((__abi_tag__ ("cxx11"))) 2025-05-07T20:26:32.6041201Z #define CHAR_BIT __CHAR_BIT__ 2025-05-07T20:26:32.6041460Z #define __flexarr [] 2025-05-07T20:26:32.6041697Z #define _GLIBCXX_HAVE_FINITEF 1 2025-05-07T20:26:32.6042064Z #define __islower_l(c,l) __isctype_l((c), _ISlower, (l)) 2025-05-07T20:26:32.6042553Z #define _IO_FLAGS2_USER_WBUF 8 2025-05-07T20:26:32.6042898Z #define _MATH_H 1 2025-05-07T20:26:32.6043271Z #define cudaOccupancyDisableCachingOverride 0x01 2025-05-07T20:26:32.6043980Z #define __S64_TYPE long int 2025-05-07T20:26:32.6044316Z #define __stub_fchflags 2025-05-07T20:26:32.6044676Z #define cudaDeviceScheduleMask 0x07 2025-05-07T20:26:32.6045080Z #define __SQUAD_TYPE long int 2025-05-07T20:26:32.6045740Z #define __INTMAX_C(c) c ## L 2025-05-07T20:26:32.6046158Z #define cudaStreamFireAndForget ((cudaStream_t)0x4) 2025-05-07T20:26:32.6046625Z #define _BSD_SIZE_T_DEFINED_ 2025-05-07T20:26:32.6046974Z #define NL_NMAX INT_MAX 2025-05-07T20:26:32.6047282Z #define _BITS_TIME_H 1 2025-05-07T20:26:32.6047662Z #define M_LN10l 2.302585092994045684017991454684364208L 2025-05-07T20:26:32.6048111Z #define _GLIBCXX_TXN_SAFE_DYN 2025-05-07T20:26:32.6048516Z #define cudaStreamTailLaunch ((cudaStream_t)0x3) 2025-05-07T20:26:32.6048975Z #define M_El 2.718281828459045235360287471352662498L 2025-05-07T20:26:32.6049387Z #define _PSTL_PRAGMA_DECLARE_SIMD _PSTL_PRAGMA(omp declare simd) 2025-05-07T20:26:32.6049931Z #define __CHAR_BIT__ 8 2025-05-07T20:26:32.6050197Z #define __FSWORD_T_TYPE __SYSCALL_SLONG_TYPE 2025-05-07T20:26:32.6050530Z #define _PSTL_STRING_CONCAT(x,y) x #y 2025-05-07T20:26:32.6050820Z #define _GLIBCXX98_USE_C99_MATH 1 2025-05-07T20:26:32.6051088Z #define FP_NAN 0 2025-05-07T20:26:32.6051347Z #define makedev(maj,min) gnu_dev_makedev (maj, min) 2025-05-07T20:26:32.6051767Z #define cudaGetDeviceProperties cudaGetDeviceProperties_v2 2025-05-07T20:26:32.6052155Z #define __cudaCDP2GetErrorString 2025-05-07T20:26:32.6052442Z #define SHRT_MAX __SHRT_MAX__ 2025-05-07T20:26:32.6052704Z #define _GLIBCXX_X86_RDSEED 1 2025-05-07T20:26:32.6052950Z #define __SM_80_RT_H__ 2025-05-07T20:26:32.6053172Z #define _NEW 2025-05-07T20:26:32.6053393Z #define CLOCK_PROCESS_CPUTIME_ID 2 2025-05-07T20:26:32.6053667Z #define __UINT8_MAX__ 0xff 2025-05-07T20:26:32.6054049Z #define _PSTL_ASSERT_MSG(_Condition,_Message) __glibcxx_assert(_Condition) 2025-05-07T20:26:32.6054476Z #define __SCHAR_WIDTH__ 8 2025-05-07T20:26:32.6054724Z #define __USE_ANSI 1 2025-05-07T20:26:32.6055020Z #define _IO_BE(expr,res) __builtin_expect ((expr), res) 2025-05-07T20:26:32.6055434Z #define __isupper_l(c,l) __isctype_l((c), _ISupper, (l)) 2025-05-07T20:26:32.6055806Z #define __cudaCDP2Memcpy2DAsync_ptsz 2025-05-07T20:26:32.6056106Z #define __WINT_MAX__ 0xffffffffU 2025-05-07T20:26:32.6056389Z #define __SIZEOF_PTHREAD_ATTR_T 56 2025-05-07T20:26:32.6056717Z #define __FLT32_MIN_EXP__ (-125) 2025-05-07T20:26:32.6057008Z #define _GLIBCXX_END_NAMESPACE_LDBL 2025-05-07T20:26:32.6057301Z #define PIPE_BUF 4096 2025-05-07T20:26:32.6057627Z #define _PSTL_PRAGMA_SIMD_ORDERED_MONOTONIC_2ARGS(PRM1,PRM2) 2025-05-07T20:26:32.6058099Z #define _GLIBCXX_NAMESPACE_LDBL_OR_CXX11 _GLIBCXX_NAMESPACE_CXX11 2025-05-07T20:26:32.6058492Z #define ADJ_TICK 0x4000 2025-05-07T20:26:32.6058779Z #define _PSTL_VERSION_PATCH (_PSTL_VERSION % 10) 2025-05-07T20:26:32.6059104Z #define MQ_PRIO_MAX 32768 2025-05-07T20:26:32.6059373Z #define __SIZEOF_PTHREAD_MUTEXATTR_T 4 2025-05-07T20:26:32.6059700Z #define __WAIT_INT(status) (*(int *) &(status)) 2025-05-07T20:26:32.6060298Z #define __GLIBC_PREREQ(maj,min) ((__GLIBC__ << 16) + __GLIBC_MINOR__ >= ((maj) << 16) + (min)) 2025-05-07T20:26:32.6060836Z #define cudaCooperativeLaunchMultiDeviceNoPreSync 0x01 2025-05-07T20:26:32.6061207Z #define _XOPEN_SOURCE 700 2025-05-07T20:26:32.6061461Z #define _POSIX2_BC_DIM_MAX 2048 2025-05-07T20:26:32.6061728Z #define __VECTOR_FUNCTIONS_HPP__ 2025-05-07T20:26:32.6062010Z #define __cpp_static_assert 201411L 2025-05-07T20:26:32.6062293Z #define __GLIBCXX__ 20230528 2025-05-07T20:26:32.6062549Z #define _GLIBCXX_HAVE_STRXFRM_L 1 2025-05-07T20:26:32.6062824Z #define _POSIX_TTY_NAME_MAX 9 2025-05-07T20:26:32.6063106Z #define _GLIBCXX_USE_WEAK_REF __GXX_WEAK__ 2025-05-07T20:26:32.6063402Z #define __OFF_T_MATCHES_OFF64_T 1 2025-05-07T20:26:32.6063691Z #define __ORDER_LITTLE_ENDIAN__ 1234 2025-05-07T20:26:32.6063997Z #define __SIZE_MAX__ 0xffffffffffffffffUL 2025-05-07T20:26:32.6064358Z #define __ispunct_l(c,l) __isctype_l((c), _ISpunct, (l)) 2025-05-07T20:26:32.6064720Z #define __WCHAR_MAX__ 0x7fffffff 2025-05-07T20:26:32.6065095Z #define _GLIBCXX_USE_CLOCK_MONOTONIC 1 2025-05-07T20:26:32.6065411Z #define __BLKCNT_T_TYPE __SYSCALL_SLONG_TYPE 2025-05-07T20:26:32.6065772Z #define __isprint_l(c,l) __isctype_l((c), _ISprint, (l)) 2025-05-07T20:26:32.6066132Z #define cudaNvSciSyncAttrSignal 0x1 2025-05-07T20:26:32.6066429Z #define _GLIBCXX_USE_LONG_LONG 1 2025-05-07T20:26:32.6066716Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_1 1 2025-05-07T20:26:32.6067045Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_2 1 2025-05-07T20:26:32.6067374Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_4 1 2025-05-07T20:26:32.6067785Z #define __DBL_DENORM_MIN__ double(4.94065645841246544176568792868221372e-324L) 2025-05-07T20:26:32.6068210Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_8 1 2025-05-07T20:26:32.6068518Z #define ADJ_ESTERROR 0x0008 2025-05-07T20:26:32.6068789Z #define __GCC_ATOMIC_CHAR_LOCK_FREE 2 2025-05-07T20:26:32.6069066Z #define __GCC_IEC_559 2 2025-05-07T20:26:32.6069358Z #define __cpp_lib_transformation_trait_aliases 201304 2025-05-07T20:26:32.6069878Z #define _IO_flockfile(_fp) 2025-05-07T20:26:32.6070147Z #define CLOCK_MONOTONIC_RAW 4 2025-05-07T20:26:32.6070415Z #define __FLT32X_DECIMAL_DIG__ 17 2025-05-07T20:26:32.6070683Z #define _IOFBF 0 2025-05-07T20:26:32.6070891Z #define __USE_BSD 1 2025-05-07T20:26:32.6071114Z #define __FLT_EVAL_METHOD__ 0 2025-05-07T20:26:32.6071394Z #define SHRT_MIN (-SHRT_MAX - 1) 2025-05-07T20:26:32.6071665Z #define _IO_USER_LOCK 0x8000 2025-05-07T20:26:32.6071921Z #define _IO_NO_WRITES 8 2025-05-07T20:26:32.6072183Z #define _GLIBCXX_PSEUDO_VISIBILITY(V) 2025-05-07T20:26:32.6072544Z #define __ASMNAME2(prefix,cname) __STRING (prefix) cname 2025-05-07T20:26:32.6072909Z #define _GLIBCXX_HAVE_SYS_STAT_H 1 2025-05-07T20:26:32.6073225Z #define MB_CUR_MAX (__ctype_get_mb_cur_max ()) 2025-05-07T20:26:32.6073558Z #define __cpp_binary_literals 201304L 2025-05-07T20:26:32.6073854Z #define _CPP_TYPE_TRAITS_H 1 2025-05-07T20:26:32.6074128Z #define __BEGIN_NAMESPACE_C99 2025-05-07T20:26:32.6074425Z #define __FLT64_DECIMAL_DIG__ 17 2025-05-07T20:26:32.6074734Z #define _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(A) 2025-05-07T20:26:32.6075130Z #define _G_HAVE_ST_BLKSIZE defined (_STATBUF_ST_BLKSIZE) 2025-05-07T20:26:32.6075501Z #define __cpp_noexcept_function_type 201510L 2025-05-07T20:26:32.6075803Z #define M_PI 3.14159265358979323846 2025-05-07T20:26:32.6076115Z #define _GLIBCXX_PACKAGE_NAME "package-unused" 2025-05-07T20:26:32.6076443Z #define _GLIBCXX_HAVE_BUILTIN_IS_SAME 1 2025-05-07T20:26:32.6076744Z #define __GCC_ATOMIC_CHAR32_T_LOCK_FREE 2 2025-05-07T20:26:32.6077051Z #define _POSIX_DELAYTIMER_MAX 32 2025-05-07T20:26:32.6077327Z #define _GLIBCXX_USE_UTIME 1 2025-05-07T20:26:32.6077593Z #define _STL_ITERATOR_BASE_FUNCS_H 1 2025-05-07T20:26:32.6078436Z #define _IO_peekc_unlocked(_fp) (_IO_BE ((_fp)->_IO_read_ptr >= (_fp)->_IO_read_end, 0) && __underflow (_fp) == EOF ? EOF : *(unsigned char *) (_fp)->_IO_read_ptr) 2025-05-07T20:26:32.6079053Z #define _GLIBCXX_TR1_ELL_INTEGRAL_TCC 1 2025-05-07T20:26:32.6079484Z #define w_termsig __wait_terminated.__w_termsig 2025-05-07T20:26:32.6079808Z #define __FLOAT_WORD_ORDER __BYTE_ORDER 2025-05-07T20:26:32.6080120Z #define __cudaCDP2GetErrorName 2025-05-07T20:26:32.6080400Z #define XATTR_SIZE_MAX 65536 2025-05-07T20:26:32.6080667Z #define be64toh(x) __bswap_64 (x) 2025-05-07T20:26:32.6080983Z #define __ASSERT_VOID_CAST static_cast 2025-05-07T20:26:32.6081321Z #define __cpp_variadic_templates 200704L 2025-05-07T20:26:32.6081618Z #define RAND_MAX 2147483647 2025-05-07T20:26:32.6081888Z #define _GLIBCXX_USE_C99_COMPLEX_TR1 1 2025-05-07T20:26:32.6082221Z #define __UINT_FAST64_MAX__ 0xffffffffffffffffUL 2025-05-07T20:26:32.6082545Z #define __SM_90_RT_H__ 2025-05-07T20:26:32.6083916Z nvcc warning : Support for offline compilation for architectures prior to '_75' will be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning). 2025-05-07T20:26:32.6084671Z 2025-05-07T20:26:32.6084770Z #define __SIG_ATOMIC_TYPE__ int 2025-05-07T20:26:32.6085336Z #define __COMPAR_FN_T 2025-05-07T20:26:32.6085573Z #define __GID_T_TYPE __U32_TYPE 2025-05-07T20:26:32.6085835Z #define _IO_BAD_SEEN 0x4000 2025-05-07T20:26:32.6086326Z #define _PSTL_PRAGMA_MESSAGE_IMPL(x) _PSTL_PRAGMA(message(_PSTL_STRING_CONCAT(_PSTL_PRAGMA_LOCATION, x))) 2025-05-07T20:26:32.6086855Z #define __DBL_MIN_10_EXP__ (-307) 2025-05-07T20:26:32.6087198Z #define __glibcxx_requires_sorted_pred(_First,_Last,_Pred) 2025-05-07T20:26:32.6087573Z #define __FINITE_MATH_ONLY__ 0 2025-05-07T20:26:32.6087881Z #define _PSTL_PRAGMA_SIMD_INCLUSIVE_SCAN(PRM) 2025-05-07T20:26:32.6088223Z #define cudaArrayColorAttachment 0x20 2025-05-07T20:26:32.6088547Z #define __cpp_variable_templates 201304L 2025-05-07T20:26:32.6089079Z #define cudaKernelNodeAttributeMemSyncDomainMap cudaLaunchAttributeMemSyncDomainMap 2025-05-07T20:26:32.6089650Z #define __cpp_lib_integral_constant_callable 201304 2025-05-07T20:26:32.6089987Z #define _GLIBCXX_HAVE_SINHF 1 2025-05-07T20:26:32.6090276Z #define MOD_TIMECONST ADJ_TIMECONST 2025-05-07T20:26:32.6090595Z #define __cpp_lib_result_of_sfinae 201210 2025-05-07T20:26:32.6090905Z #define __SM_30_INTRINSICS_H__ 2025-05-07T20:26:32.6091181Z #define __FLT32X_MAX_EXP__ 1024 2025-05-07T20:26:32.6091458Z #define _GLIBCXX_USE_WCHAR_T 1 2025-05-07T20:26:32.6091722Z #define _GLIBCXX_MATH_H 1 2025-05-07T20:26:32.6091976Z #define __u_char_defined 2025-05-07T20:26:32.6092327Z #define WIFEXITED(status) __WIFEXITED (__WAIT_INT (status)) 2025-05-07T20:26:32.6092701Z #define STA_PPSERROR 0x0800 2025-05-07T20:26:32.6092960Z #define _GLIBCXX_STD_A std 2025-05-07T20:26:32.6093222Z #define __FLT32_HAS_DENORM__ 1 2025-05-07T20:26:32.6093511Z #define _GLIBCXX_BEGIN_NAMESPACE_VERSION 2025-05-07T20:26:32.6093963Z #define __device_builtin_texture_type__ __location__(device_builtin_texture_type) 2025-05-07T20:26:32.6094407Z #define FP_INFINITE 1 2025-05-07T20:26:32.6094787Z #define _GLIBCXX11_DEPRECATED_SUGGEST(ALT) _GLIBCXX_DEPRECATED_SUGGEST(ALT) 2025-05-07T20:26:32.6095319Z #define _IO_pid_t __pid_t 2025-05-07T20:26:32.6095685Z #define __UINT_FAST8_MAX__ 0xff 2025-05-07T20:26:32.6096041Z #define __LEAF , __leaf__ 2025-05-07T20:26:32.6096356Z #define PATH_MAX 4096 2025-05-07T20:26:32.6096689Z #define __cpp_rvalue_reference 200610L 2025-05-07T20:26:32.6097152Z #define __LDBL_REDIR1(name,proto,alias) name proto 2025-05-07T20:26:32.6097501Z #define _LIMITS_H___ 2025-05-07T20:26:32.6097722Z #define __size_t 2025-05-07T20:26:32.6098012Z #define _GLIBCXX_HAVE_FREXPF 1 2025-05-07T20:26:32.6098746Z #define STA_RONLY (STA_PPSSIGNAL | STA_PPSJITTER | STA_PPSWANDER | STA_PPSERROR | STA_CLOCKERR | STA_NANO | STA_MODE | STA_CLK) 2025-05-07T20:26:32.6099578Z #define _GLIBCXX_HAVE_FREXPL 1 2025-05-07T20:26:32.6099953Z #define __cpp_nested_namespace_definitions 201411L 2025-05-07T20:26:32.6100294Z #define __DEC64_MAX_EXP__ 385 2025-05-07T20:26:32.6100554Z #define _WCHAR_T_DEFINED 2025-05-07T20:26:32.6100921Z #define __glibcxx_requires_can_decrement_range(_First1,_Last1,_First2) 2025-05-07T20:26:32.6101552Z #define MOD_STATUS ADJ_STATUS 2025-05-07T20:26:32.6101856Z #define _GLIBCXX_PURE __attribute__ ((__pure__)) 2025-05-07T20:26:32.6102178Z #define _GLIBCXX_HAVE_STDINT_H 1 2025-05-07T20:26:32.6102462Z #define __SIZEOF_PTHREAD_CONDATTR_T 4 2025-05-07T20:26:32.6102743Z #define __INT8_C(c) c 2025-05-07T20:26:32.6102993Z #define __cudaCDP2GetParameterBuffer 2025-05-07T20:26:32.6103292Z #define _GLIBCXX_HAVE_COSHF 1 2025-05-07T20:26:32.6103556Z #define _GLIBCXX_HAVE_COSHL 1 2025-05-07T20:26:32.6103808Z #define __SM_70_RT_HPP__ 2025-05-07T20:26:32.6104061Z #define __INT_LEAST8_WIDTH__ 8 2025-05-07T20:26:32.6104342Z #define __cpp_variadic_using 201611L 2025-05-07T20:26:32.6104962Z #define __UINT_LEAST64_MAX__ 0xffffffffffffffffUL 2025-05-07T20:26:32.6105289Z #define __INT_LEAST8_MAX__ 0x7f 2025-05-07T20:26:32.6105561Z #define __SM_61_INTRINSICS_HPP__ 2025-05-07T20:26:32.6105830Z #define _IO_FLAGS2_MMAP 1 2025-05-07T20:26:32.6106089Z #define __cpp_capture_star_this 201603L 2025-05-07T20:26:32.6106411Z #define __cudaCDP2LaunchDeviceV2_ptsz 2025-05-07T20:26:32.6106851Z #define _GLIBCXX_HAVE_ENDIAN_H 1 2025-05-07T20:26:32.6107214Z #define __always_inline __inline __attribute__ ((__always_inline__)) 2025-05-07T20:26:32.6107602Z #define NFDBITS __NFDBITS 2025-05-07T20:26:32.6107861Z #define _PSTL_PRAGMA_FORCEINLINE 2025-05-07T20:26:32.6108274Z #define _GLIBCXX_HAVE_SYS_STATVFS_H 1 2025-05-07T20:26:32.6108601Z #define __glibcxx_requires_sorted(_First,_Last) 2025-05-07T20:26:32.6108922Z #define __SHRT_MAX__ 0x7fff 2025-05-07T20:26:32.6109173Z #define _GLIBCXX_SYMVER_GNU 1 2025-05-07T20:26:32.6109464Z #define w_stopval __wait_stopped.__w_stopval 2025-05-07T20:26:32.6109932Z #define STA_UNSYNC 0x0040 2025-05-07T20:26:32.6110247Z #define __LDBL_MAX__ 1.18973149535723176502126385303097021e+4932L 2025-05-07T20:26:32.6110679Z #define _GLIBCXX_USE_C99_COMPLEX _GLIBCXX11_USE_C99_COMPLEX 2025-05-07T20:26:32.6111048Z #define __FLT64X_MAX_10_EXP__ 4932 2025-05-07T20:26:32.6111339Z #define __cpp_if_constexpr 201606L 2025-05-07T20:26:32.6111665Z #define __glibcxx_class_requires4(_a,_b,_c,_d,_e) 2025-05-07T20:26:32.6111996Z #define _GLIBCXX_HAVE_WCHAR_H 1 2025-05-07T20:26:32.6112321Z #define _GLIBCXX_USE_C99_STDIO _GLIBCXX11_USE_C99_STDIO 2025-05-07T20:26:32.6112657Z #define __daddr_t_defined 2025-05-07T20:26:32.6112909Z #define __LDBL_IS_IEC_60559__ 2 2025-05-07T20:26:32.6113184Z #define _GLIBCXX_TR1_RIEMANN_ZETA_TCC 1 2025-05-07T20:26:32.6113498Z #define _GLIBCXX_HAVE_STRUCT_DIRENT_D_TYPE 1 2025-05-07T20:26:32.6114036Z #define _PSTL_CPP11_STD_ROTATE_BROKEN ((__GLIBCXX__ && __GLIBCXX__ < 20150716) || (_MSC_VER && _MSC_VER < 1800)) 2025-05-07T20:26:32.6114551Z #define _ACRTIMP 2025-05-07T20:26:32.6114772Z #define _IO_EOF_SEEN 0x10 2025-05-07T20:26:32.6115046Z #define _GLIBCXX_TR1_POLY_LAGUERRE_TCC 1 2025-05-07T20:26:32.6115345Z #define _IOS_BIN 128 2025-05-07T20:26:32.6115704Z #define __fortify_function __extern_always_inline __attribute_artificial__ 2025-05-07T20:26:32.6116129Z #define __FLT64X_HAS_QUIET_NAN__ 1 2025-05-07T20:26:32.6116418Z #define UNDERFLOW 4 2025-05-07T20:26:32.6116637Z #define NAME_MAX 255 2025-05-07T20:26:32.6116871Z #define SCHAR_MAX __SCHAR_MAX__ 2025-05-07T20:26:32.6117146Z #define __UINT_LEAST8_MAX__ 0xff 2025-05-07T20:26:32.6117430Z #define __GCC_ATOMIC_BOOL_LOCK_FREE 2 2025-05-07T20:26:32.6117721Z #define _IO_UNIFIED_JUMPTABLES 1 2025-05-07T20:26:32.6118111Z #define __FLT128_DENORM_MIN__ 6.47517511943802511092443895822764655e-4966F128 2025-05-07T20:26:32.6118516Z #define __ptr_t void * 2025-05-07T20:26:32.6118758Z #define M_E 2.7182818284590452354 2025-05-07T20:26:32.6119041Z #define cudaSurfaceType1D 0x01 2025-05-07T20:26:32.6119311Z #define __USE_ISOCXX11 1 2025-05-07T20:26:32.6119581Z #define __UINTMAX_TYPE__ long unsigned int 2025-05-07T20:26:32.6119901Z #define cudaDeviceBlockingSync 0x04 2025-05-07T20:26:32.6120202Z #define CLOCK_MONOTONIC_COARSE 6 2025-05-07T20:26:32.6120486Z #define _GLIBCXX_OS_DEFINES 1 2025-05-07T20:26:32.6120771Z #define _GLIBCXX_NODISCARD [[__nodiscard__]] 2025-05-07T20:26:32.6121248Z #define cudaSurfaceType2D 0x02 2025-05-07T20:26:32.6121512Z #define __linux 1 2025-05-07T20:26:32.6121727Z #define __DEC32_EPSILON__ 1E-6DF 2025-05-07T20:26:32.6121998Z #define cudaDeviceMask 0xff 2025-05-07T20:26:32.6122264Z #define _GLIBCXX_END_NAMESPACE_ALGO 2025-05-07T20:26:32.6122549Z #define __CUDA_API_VER_MAJOR__ 12 2025-05-07T20:26:32.6122832Z #define htobe16(x) __bswap_16 (x) 2025-05-07T20:26:32.6123133Z #define HUGE_VALF (__builtin_huge_valf()) 2025-05-07T20:26:32.6123441Z #define __FLT_EVAL_METHOD_TS_18661_3__ 0 2025-05-07T20:26:32.6123753Z #define HUGE_VALL (__builtin_huge_vall()) 2025-05-07T20:26:32.6124054Z #define _BITS_TYPES_H 1 2025-05-07T20:26:32.6124349Z #define ULONG_LONG_MAX (LONG_LONG_MAX * 2ULL + 1ULL) 2025-05-07T20:26:32.6124694Z #define _IO_cleanup_region_end(_Doit) 2025-05-07T20:26:32.6125002Z #define cudaSurfaceType3D 0x03 2025-05-07T20:26:32.6125289Z #define _GLIBCXX_HAVE_SYS_TIME_H 1 2025-05-07T20:26:32.6125576Z #define __cudaGet_blockIdx() blockIdx 2025-05-07T20:26:32.6125965Z #define _IO_DONT_CLOSE 0100000 2025-05-07T20:26:32.6126838Z #define __MATHDECLX(type,function,suffix,args,attrib) __MATHDECL_1(type, function,suffix, args) __attribute__ (attrib); __MATHDECL_1(type, __CONCAT(__,function),suffix, args) __attribute__ (attrib) 2025-05-07T20:26:32.6127691Z #define cudaHostRegisterDefault 0x00 2025-05-07T20:26:32.6127980Z #define __unix 1 2025-05-07T20:26:32.6128192Z #define MATH_ERRNO 1 2025-05-07T20:26:32.6128424Z #define _GLIBCXX_STDIO_SEEK_END 2 2025-05-07T20:26:32.6128704Z #define _GLIBCXX_USE_FCHMODAT 1 2025-05-07T20:26:32.6128968Z #define __SM_100_RT_H__ 2025-05-07T20:26:32.6129215Z #define __UINT32_MAX__ 0xffffffffU 2025-05-07T20:26:32.6129496Z #define __GXX_EXPERIMENTAL_CXX0X__ 1 2025-05-07T20:26:32.6129784Z #define __UID_T_TYPE __U32_TYPE 2025-05-07T20:26:32.6130058Z #define _GLIBCXX20_DEPRECATED(MSG) 2025-05-07T20:26:32.6130355Z #define _GLIBCXX_HAVE_ATOMIC_LOCK_POLICY 1 2025-05-07T20:26:32.6130839Z #define __CUDART_API_VERSION ((__CUDA_API_VER_MAJOR__ * 1000) + (__CUDA_API_VER_MINOR__ * 10)) 2025-05-07T20:26:32.6131328Z #define __nv_pure__ __location__(nv_pure) 2025-05-07T20:26:32.6131623Z #define CUDARTAPI_CDECL 2025-05-07T20:26:32.6131882Z #define _PSTL_USAGE_WARNINGS 0 2025-05-07T20:26:32.6132157Z #define _GLIBCXX98_USE_C99_COMPLEX 1 2025-05-07T20:26:32.6132445Z #define __cpp_lib_void_t 201411 2025-05-07T20:26:32.6132716Z #define _POSIX_AIO_MAX 1 2025-05-07T20:26:32.6132956Z #define __SIZE_T 2025-05-07T20:26:32.6133208Z #define isgraph_l(c,l) __isgraph_l ((c), (l)) 2025-05-07T20:26:32.6133530Z #define _GLIBCXX_FULLY_DYNAMIC_STRING 0 2025-05-07T20:26:32.6133834Z #define _POSIX_PIPE_BUF 512 2025-05-07T20:26:32.6134102Z #define __CUDA_RUNTIME_API_H__ 2025-05-07T20:26:32.6134367Z #define _GLIBCXX_HAVE_STRTOLD 1 2025-05-07T20:26:32.6134634Z #define _ATFILE_SOURCE 1 2025-05-07T20:26:32.6135036Z #define __glibcxx_assert(cond) do { __glibcxx_constexpr_assert(cond); } while (false) 2025-05-07T20:26:32.6135481Z #define __WAIT_STATUS void * 2025-05-07T20:26:32.6135759Z #define __MATH_FUNCTIONS_H__ 2025-05-07T20:26:32.6136029Z #define _GLIBCXX_HAVE_WCSTOF 1 2025-05-07T20:26:32.6136295Z #define __FLT128_MIN_EXP__ (-16381) 2025-05-07T20:26:32.6136585Z #define _GLIBCXX_HAVE_LC_MESSAGES 1 2025-05-07T20:26:32.6136868Z #define __WINT_MIN__ 0U 2025-05-07T20:26:32.6137474Z #define _PSTL_CPP14_VARIABLE_TEMPLATES_PRESENT (!__INTEL_COMPILER || __INTEL_COMPILER >= 1700) && (_MSC_FULL_VER >= 190023918 || __cplusplus >= 201402L) 2025-05-07T20:26:32.6138363Z #define isdigit_l(c,l) __isdigit_l ((c), (l)) 2025-05-07T20:26:32.6138673Z #define WUNTRACED 2 2025-05-07T20:26:32.6138903Z #define _GLIBCXX_HAVE_SQRTF 1 2025-05-07T20:26:32.6139171Z #define __SIZEOF_PTHREAD_RWLOCKATTR_T 8 2025-05-07T20:26:32.6139459Z #define NZERO 20 2025-05-07T20:26:32.6139684Z #define _GLIBCXX_HAVE_MEMALIGN 1 2025-05-07T20:26:32.6139976Z #define _PSTL_PRAGMA(x) _Pragma(#x) 2025-05-07T20:26:32.6140373Z #define MOD_CLKA ADJ_OFFSET_SINGLESHOT 2025-05-07T20:26:32.6140728Z #define MOD_CLKB ADJ_TICK 2025-05-07T20:26:32.6141097Z #define __FLT128_MIN_10_EXP__ (-4931) 2025-05-07T20:26:32.6141392Z #define __FLT32X_IS_IEC_60559__ 2 2025-05-07T20:26:32.6141749Z #define __DEVICE_FUNCTIONS_H__ 2025-05-07T20:26:32.6142025Z #define SCHAR_MIN (-SCHAR_MAX - 1) 2025-05-07T20:26:32.6142300Z #define EXIT_FAILURE 1 2025-05-07T20:26:32.6142538Z #define ADJ_MAXERROR 0x0004 2025-05-07T20:26:32.6142798Z #define __INT_LEAST16_WIDTH__ 16 2025-05-07T20:26:32.6143060Z #define _SIZE_T_DEFINED_ 2025-05-07T20:26:32.6143315Z #define _POSIX_AIO_LISTIO_MAX 2 2025-05-07T20:26:32.6143596Z #define __cudaCDP2DeviceGetLimit 2025-05-07T20:26:32.6143930Z #define __LDBL_REDIR_NTH(name,proto) name proto __THROW 2025-05-07T20:26:32.6144293Z #define __cudaCDP2FuncGetAttributes 2025-05-07T20:26:32.6144591Z #define __SCHAR_MAX__ 0x7f 2025-05-07T20:26:32.6144840Z #define __FLT128_MANT_DIG__ 113 2025-05-07T20:26:32.6145121Z #define __USING_NAMESPACE_STD(name) 2025-05-07T20:26:32.6145422Z #define _GLIBCXX_HAVE_OBSOLETE_ISINF 1 2025-05-07T20:26:32.6145738Z #define __WCHAR_MIN__ (-__WCHAR_MAX__ - 1) 2025-05-07T20:26:32.6147729Z #define SEEK_DATA 3 2025-05-07T20:26:32.6147958Z #define __KERNEL_STRICT_NAMES 2025-05-07T20:26:32.6148247Z #define _IO_stderr ((_IO_FILE*)(&_IO_2_1_stderr_)) 2025-05-07T20:26:32.6148682Z #define _IO_ferror_unlocked(__fp) (((__fp)->_flags & _IO_ERR_SEEN) != 0) 2025-05-07T20:26:32.6149082Z #define _FUNCTEXCEPT_H 1 2025-05-07T20:26:32.6149333Z #define __INT64_C(c) c ## L 2025-05-07T20:26:32.6149596Z #define __NTH(fct) __LEAF_ATTR fct throw () 2025-05-07T20:26:32.6150065Z #define _GLIBCXX_CONST __attribute__ ((__const__)) 2025-05-07T20:26:32.6150398Z #define _GLIBCXX_HAVE_LINK 1 2025-05-07T20:26:32.6150667Z #define cudaNvSciSyncAttrWait 0x2 2025-05-07T20:26:32.6150970Z #define __GCC_ATOMIC_POINTER_LOCK_FREE 2 2025-05-07T20:26:32.6151282Z #define STA_PPSWANDER 0x0400 2025-05-07T20:26:32.6151541Z #define __INT_WCHAR_T_H 2025-05-07T20:26:32.6151791Z #define WSTOPPED 2 2025-05-07T20:26:32.6152029Z #define _POSIX_THREAD_THREADS_MAX 64 2025-05-07T20:26:32.6152330Z #define _POSIX_MQ_OPEN_MAX 8 2025-05-07T20:26:32.6152588Z #define FP_NORMAL 4 2025-05-07T20:26:32.6152832Z #define __cudaCDP2LaunchDevice_ptsz 2025-05-07T20:26:32.6153116Z #define _BITS_TIMEX_H 1 2025-05-07T20:26:32.6153360Z #define _POSIX_LINK_MAX 8 2025-05-07T20:26:32.6153628Z #define _GLIBCXX_HAVE_LIMIT_FSIZE 1 2025-05-07T20:26:32.6153924Z #define _GLIBCXX_HAVE_ATAN2F 1 2025-05-07T20:26:32.6154199Z #define cudaTextureType1D 0x01 2025-05-07T20:26:32.6154474Z #define _GLIBCXX_HAVE_ATAN2L 1 2025-05-07T20:26:32.6154743Z #define COLL_WEIGHTS_MAX 255 2025-05-07T20:26:32.6155015Z #define __isascii(c) (((c) & ~0x7f) == 0) 2025-05-07T20:26:32.6155321Z #define __toascii(c) ((c) & 0x7f) 2025-05-07T20:26:32.6155767Z #define __attribute_format_strfmon__(a,b) __attribute__ ((__format__ (__strfmon__, a, b))) 2025-05-07T20:26:32.6156253Z #define _IO_MAGIC 0xFBAD0000 2025-05-07T20:26:32.6156523Z #define _GLIBCXX_USE_SENDFILE 1 2025-05-07T20:26:32.6156791Z #define _POSIX_SOURCE 1 2025-05-07T20:26:32.6157051Z #define cudaTextureType2D 0x02 2025-05-07T20:26:32.6157317Z #define _PTR_TRAITS_H 1 2025-05-07T20:26:32.6157592Z #define _GLIBCXX_NOEXCEPT_QUAL noexcept (_NE) 2025-05-07T20:26:32.6157917Z #define _GLIBCXX_HAVE_POWF 1 2025-05-07T20:26:32.6158186Z #define _POSIX2_BC_STRING_MAX 1000 2025-05-07T20:26:32.6158516Z #define __attribute_used__ __attribute__ ((__used__)) 2025-05-07T20:26:32.6158865Z #define cudaTextureType3D 0x03 2025-05-07T20:26:32.6159136Z #define _STDIO_USES_IOSTREAM 2025-05-07T20:26:32.6159402Z #define CLOCK_REALTIME 0 2025-05-07T20:26:32.6159655Z #define __FLT32X_MANT_DIG__ 53 2025-05-07T20:26:32.6159929Z #define __GCC_ATOMIC_CHAR16_T_LOCK_FREE 2 2025-05-07T20:26:32.6160244Z #define __cpp_aligned_new 201606L 2025-05-07T20:26:32.6160529Z #define __USER_LABEL_PREFIX__ 2025-05-07T20:26:32.6160806Z #define cudaEventBlockingSync 0x01 2025-05-07T20:26:32.6161099Z #define _GLIBCXX_HAVE_TANL 1 2025-05-07T20:26:32.6161377Z #define _GLIBCXX_USE_PTHREAD_RWLOCK_T 1 2025-05-07T20:26:32.6161772Z #define _GLIBCXX_HAVE_LINUX_RANDOM_H 1 2025-05-07T20:26:32.6162075Z #define _GLIBCXX_USE_C99_FENV_TR1 1 2025-05-07T20:26:32.6162354Z #define __FLT32_MAX_10_EXP__ 38 2025-05-07T20:26:32.6174707Z #define __GLIBC__ 2 2025-05-07T20:26:32.6174997Z #define __END_DECLS } 2025-05-07T20:26:32.6175248Z #define FP_ILOGB0 (-2147483647 - 1) 2025-05-07T20:26:32.6175632Z #define __FLT64X_EPSILON__ 1.08420217248550443400745280086994171e-19F64x 2025-05-07T20:26:32.6176030Z #define __CONCAT(x,y) x ## y 2025-05-07T20:26:32.6176281Z #define WCONTINUED 8 2025-05-07T20:26:32.6176505Z #define __STDC_HOSTED__ 1 2025-05-07T20:26:32.6176753Z #define _GLIBCXX_HAVE_ARPA_INET_H 1 2025-05-07T20:26:32.6177020Z #define _ALLOCA_H 1 2025-05-07T20:26:32.6177238Z #define __host__ __location__(host) 2025-05-07T20:26:32.6177666Z #define __warndecl(name,msg) extern void name (void) __attribute__((__warning__ (msg))) 2025-05-07T20:26:32.6178113Z #define __SLONG32_TYPE int 2025-05-07T20:26:32.6178370Z #define _GLIBCXX_DEBUG_ASSERTIONS_H 1 2025-05-07T20:26:32.6178902Z #define _SYS_SELECT_H 1 2025-05-07T20:26:32.6179135Z #define _IO_LINE_BUF 0x200 2025-05-07T20:26:32.6179378Z #define _IOS_NOCREATE 32 2025-05-07T20:26:32.6179623Z #define __DEC64_MIN_EXP__ (-382) 2025-05-07T20:26:32.6179895Z #define __cudaGet_warpSize() warpSize 2025-05-07T20:26:32.6180186Z #define __SSIZE_T_TYPE __SWORD_TYPE 2025-05-07T20:26:32.6180465Z #define _GLIBCXX_HAVE_LIMIT_VMEM 0 2025-05-07T20:26:32.6180751Z #define __global__ __location__(global) 2025-05-07T20:26:32.6181035Z #define __GNU_LIBRARY__ 6 2025-05-07T20:26:32.6181282Z #define __cpp_decltype_auto 201304L 2025-05-07T20:26:32.6181559Z #define __DBL_DIG__ 15 2025-05-07T20:26:32.6181778Z #define TIME_UTC 1 2025-05-07T20:26:32.6181995Z #define __FLT32_DIG__ 6 2025-05-07T20:26:32.6182321Z #define __forceinline__ __inline__ __attribute__((always_inline)) 2025-05-07T20:26:32.6182729Z #define cudaHostAllocWriteCombined 0x04 2025-05-07T20:26:32.6183426Z #define cudaDeviceScheduleAuto 0x00 2025-05-07T20:26:32.6183756Z #define iscntrl_l(c,l) __iscntrl_l ((c), (l)) 2025-05-07T20:26:32.6184069Z #define _G_BUFSIZ 8192 2025-05-07T20:26:32.6184381Z #define __FLT_EPSILON__ 1.19209289550781250000000000000000000e-7F 2025-05-07T20:26:32.6184755Z #define cudaTextureTypeCubemap 0x0C 2025-05-07T20:26:32.6185060Z #define __cudaCDP2GetDevice 2025-05-07T20:26:32.6185347Z #define __cudaCDP2PeekAtLastError 2025-05-07T20:26:32.6185638Z #define STA_CLOCKERR 0x1000 2025-05-07T20:26:32.6185888Z #define __GXX_WEAK__ 1 2025-05-07T20:26:32.6186146Z #define __RLIM_T_TYPE __SYSCALL_ULONG_TYPE 2025-05-07T20:26:32.6186455Z #define _GLIBCXX_HAVE_ISNANF 1 2025-05-07T20:26:32.6186712Z #define __SHRT_WIDTH__ 16 2025-05-07T20:26:32.6187001Z #define __cpp_lib_robust_nonmodifying_seq_ops 201304 2025-05-07T20:26:32.6187338Z #define _GLIBCXX_BITS_SPECFUN_H 1 2025-05-07T20:26:32.6187614Z #define _GLIBCXX_HAVE_ISNANL 1 2025-05-07T20:26:32.6187906Z #define isblank_l(c,l) __isblank_l ((c), (l)) 2025-05-07T20:26:32.6188203Z #define _G_config_h 1 2025-05-07T20:26:32.6188485Z #define M_LOG2El 1.442695040888963407359924681001892137L 2025-05-07T20:26:32.6188834Z #define ADJ_OFFSET_SINGLESHOT 0x8001 2025-05-07T20:26:32.6189107Z #define _GCC_WCHAR_T 2025-05-07T20:26:32.6189343Z #define TMP_MAX 238328 2025-05-07T20:26:32.6189585Z #define __FLT32_IS_IEC_60559__ 2 2025-05-07T20:26:32.6189971Z #define __DEVICE_TYPES_H__ 2025-05-07T20:26:32.6190236Z #define __DEV_T_TYPE __UQUAD_TYPE 2025-05-07T20:26:32.6190518Z #define _EXT_NUMERIC_TRAITS 1 2025-05-07T20:26:32.6190793Z #define _GLIBCXX_BEGIN_NAMESPACE_ALGO 2025-05-07T20:26:32.6191090Z #define _IO_SKIPWS 01 2025-05-07T20:26:32.6191509Z #define cudaStreamGraphFireAndForgetAsSibling (cudaStream_t)0x0300000000000000 2025-05-07T20:26:32.6191993Z #define _IO_SCIENTIFIC 04000 2025-05-07T20:26:32.6192256Z #define _GLIBCXX_HAVE_STRING_H 1 2025-05-07T20:26:32.6192599Z #define __LDBL_MIN__ 3.36210314311209350626267781732175260e-4932L 2025-05-07T20:26:32.6192982Z #define cudaDeviceScheduleSpin 0x01 2025-05-07T20:26:32.6193611Z #define __nonnull(params) __attribute__ ((__nonnull__ params)) 2025-05-07T20:26:32.6193993Z #define __DBL_IS_IEC_60559__ 2 2025-05-07T20:26:32.6194247Z #define le32toh(x) (x) 2025-05-07T20:26:32.6194473Z #define _SIZE_T_DEFINED 2025-05-07T20:26:32.6194723Z #define _GLIBCXX_HAVE_XLOCALE_H 1 2025-05-07T20:26:32.6195061Z #define cudaArraySparsePropertiesSingleMipTail 0x1 2025-05-07T20:26:32.6195409Z #define __DEC32_MAX__ 9.999999E96DF 2025-05-07T20:26:32.6195821Z #define __WIFSIGNALED(status) (((signed char) (((status) & 0x7f) + 1) >> 1) > 0) 2025-05-07T20:26:32.6196251Z #define _GLIBCXX_HAVE_FMODL 1 2025-05-07T20:26:32.6196513Z #define _GLIBCXX_HAVE_POLL 1 2025-05-07T20:26:32.6196768Z #define __SM_32_INTRINSICS_H__ 2025-05-07T20:26:32.6197029Z #define _POSIX_NAME_MAX 14 2025-05-07T20:26:32.6197304Z #define __cpp_threadsafe_static_init 200806L 2025-05-07T20:26:32.6197842Z #define _GLIBCXX_MAKE_MOVE_IF_NOEXCEPT_ITERATOR(_Iter) std::__make_move_if_noexcept_iterator(_Iter) 2025-05-07T20:26:32.6198365Z #define _GLIBCXX_USE_CLOCK_REALTIME 1 2025-05-07T20:26:32.6198814Z #define __cpp_enumerator_attributes 201411L 2025-05-07T20:26:32.6199422Z #define __WCOREDUMP(status) ((status) & __WCOREFLAG) 2025-05-07T20:26:32.6199749Z #define _WCHAR_T_ 2025-05-07T20:26:32.6199974Z #define _GLIBCXX_FAST_MATH 0 2025-05-07T20:26:32.6200353Z #define __FLT64X_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951F64x 2025-05-07T20:26:32.6200747Z #define RTSIG_MAX 32 2025-05-07T20:26:32.6200971Z #define _STDDEF_H 2025-05-07T20:26:32.6201204Z #define CU_UUID_HAS_BEEN_DEFINED 2025-05-07T20:26:32.6201468Z #define _VA_LIST_DEFINED 2025-05-07T20:26:32.6201719Z #define __FLT32X_HAS_INFINITY__ 1 2025-05-07T20:26:32.6202055Z #define __glibcxx_requires_non_empty_range(_First,_Last) 2025-05-07T20:26:32.6202445Z #define __grid_constant__ __location__(grid_constant) 2025-05-07T20:26:32.6202781Z #define __INT32_MAX__ 0x7fffffff 2025-05-07T20:26:32.6203074Z #define _GLIBCXX_BEGIN_EXTERN_C extern "C" { 2025-05-07T20:26:32.6203553Z #define _PSTL_CPP14_INTEGER_SEQUENCE_PRESENT (_MSC_VER >= 1900 || __cplusplus >= 201402L) 2025-05-07T20:26:32.6204108Z #define __glibcxx_digits_b(T,B) (B - __glibcxx_signed_b (T,B)) 2025-05-07T20:26:32.6204493Z #define __SIZEOF_PTHREAD_COND_T 48 2025-05-07T20:26:32.6204821Z #define _PSTL_PRAGMA_SIMD_ORDERED_MONOTONIC(PRM) 2025-05-07T20:26:32.6205142Z #define __unix__ 1 2025-05-07T20:26:32.6205379Z #define __SM_60_ATOMIC_FUNCTIONS_H__ 2025-05-07T20:26:32.6205667Z #define __INT_WIDTH__ 32 2025-05-07T20:26:32.6205909Z #define __SIZEOF_LONG__ 8 2025-05-07T20:26:32.6206153Z #define _IONBF 2 2025-05-07T20:26:32.6206615Z #define __MATHCALLX(function,suffix,args,attrib) __MATHDECLX (_Mdouble_,function,suffix, args, attrib) 2025-05-07T20:26:32.6207422Z #define _IO_getc_unlocked(_fp) (_IO_BE ((_fp)->_IO_read_ptr >= (_fp)->_IO_read_end, 0) ? __uflow (_fp) : *(unsigned char *) (_fp)->_IO_read_ptr++) 2025-05-07T20:26:32.6207989Z #define __STDC_IEC_559__ 1 2025-05-07T20:26:32.6208247Z #define __STDC_ISO_10646__ 201103L 2025-05-07T20:26:32.6208524Z #define __UINT16_C(c) c 2025-05-07T20:26:32.6208773Z #define M_2_PI 0.63661977236758134308 2025-05-07T20:26:32.6209051Z #define STA_DEL 0x0020 2025-05-07T20:26:32.6209297Z #define __CUDACC_VER_MINOR__ 8 2025-05-07T20:26:32.6209553Z #define __id_t_defined 2025-05-07T20:26:32.6209828Z #define w_retcode __wait_terminated.__w_retcode 2025-05-07T20:26:32.6210302Z #define _IO_PENDING_OUTPUT_COUNT(_fp) ((_fp)->_IO_write_ptr - (_fp)->_IO_write_base) 2025-05-07T20:26:32.6210747Z #define _GLIBCXX_HAVE_MODFF 1 2025-05-07T20:26:32.6211023Z #define _GLIBCXX_HAVE_MODFL 1 2025-05-07T20:26:32.6211290Z #define __DECIMAL_DIG__ 21 2025-05-07T20:26:32.6211547Z #define _POSIX2_RE_DUP_MAX 255 2025-05-07T20:26:32.6211819Z #define __USE_FORTIFY_LEVEL 0 2025-05-07T20:26:32.6212090Z #define __STDC_IEC_559_COMPLEX__ 1 2025-05-07T20:26:32.6212356Z #define SING 2 2025-05-07T20:26:32.6212575Z #define STA_FREQHOLD 0x0080 2025-05-07T20:26:32.6212848Z #define __SM_32_ATOMIC_FUNCTIONS_HPP__ 2025-05-07T20:26:32.6213159Z #define cudaStreamDefault 0x00 2025-05-07T20:26:32.6213633Z #define __FLT64_EPSILON__ 2.22044604925031308084726333618164062e-16F64 2025-05-07T20:26:32.6214022Z #define _GLIBCXX_HAVE_HYPOTL 1 2025-05-07T20:26:32.6214295Z #define _GLIBCXX_HAVE_SYS_UIO_H 1 2025-05-07T20:26:32.6214558Z #define __gnu_linux__ 1 2025-05-07T20:26:32.6214795Z #define __INT16_MAX__ 0x7fff 2025-05-07T20:26:32.6215051Z #define _LARGEFILE_SOURCE 1 2025-05-07T20:26:32.6215352Z #define MAX_INPUT 255 2025-05-07T20:26:32.6215601Z #define __FLT64_MIN_EXP__ (-1021) 2025-05-07T20:26:32.6215934Z #define __isalpha_l(c,l) __isctype_l((c), _ISalpha, (l)) 2025-05-07T20:26:32.6216314Z #define __glibcxx_requires_heap(_First,_Last) 2025-05-07T20:26:32.6216674Z #define _GLIBCXX_CPU_DEFINES 1 2025-05-07T20:26:32.6216951Z #define _GLIBCXX_HAVE_POLL_H 1 2025-05-07T20:26:32.6217361Z #define __attribute_warn_unused_result__ __attribute__ ((__warn_unused_result__)) 2025-05-07T20:26:32.6217797Z #define _IO_SHOWPOS 02000 2025-05-07T20:26:32.6218131Z #define _GLIBCXX_HAVE_SYMVER_SYMBOL_RENAMING_RUNTIME_SUPPORT 1 2025-05-07T20:26:32.6218606Z #define _Mfloat_ float 2025-05-07T20:26:32.6218863Z #define __glibcxx_requires_cond(_Cond,_Msg) 2025-05-07T20:26:32.6219177Z #define __FLT64X_MIN_10_EXP__ (-4931) 2025-05-07T20:26:32.6219472Z #define DELAYTIMER_MAX 2147483647 2025-05-07T20:26:32.6219790Z #define cudaMemPoolCreateUsageHwDecompress 0x2 2025-05-07T20:26:32.6220351Z #define __glibcxx_max_b(T,B) (__glibcxx_signed_b (T,B) ? (((((T)1 << (__glibcxx_digits_b (T,B) - 1)) - 1) << 1) + 1) : ~(T)0) 2025-05-07T20:26:32.6220874Z #define __LDBL_HAS_QUIET_NAN__ 1 2025-05-07T20:26:32.6221161Z #define _GLIBCXX98_USE_C99_STDIO 1 2025-05-07T20:26:32.6221491Z #define cudaKernelNodeAttrID cudaLaunchAttributeID 2025-05-07T20:26:32.6221861Z #define __glibcxx_class_requires2(_a,_b,_c) 2025-05-07T20:26:32.6222169Z #define __USE_ISOC11 1 2025-05-07T20:26:32.6222396Z #define _BSD_SIZE_T_ 2025-05-07T20:26:32.6222632Z #define ADJ_MICRO 0x1000 2025-05-07T20:26:32.6222887Z #define _GLIBCXX_HAVE_FABSF 1 2025-05-07T20:26:32.6223162Z #define _GLIBCXX_HAVE_FABSL 1 2025-05-07T20:26:32.6223467Z #define _PSTL_PRAGMA_SIMD _PSTL_PRAGMA(omp simd) 2025-05-07T20:26:32.6223798Z #define __FLT64_MANT_DIG__ 53 2025-05-07T20:26:32.6224107Z #define __attribute_const__ __attribute__ ((__const__)) 2025-05-07T20:26:32.6224452Z #define __THROW throw () 2025-05-07T20:26:32.6224712Z #define __cudaGet_gridDim() gridDim 2025-05-07T20:26:32.6225007Z #define __SM_60_ATOMIC_FUNCTIONS_HPP__ 2025-05-07T20:26:32.6225371Z #define __glibcxx_requires_heap_pred(_First,_Last,_Pred) 2025-05-07T20:26:32.6225740Z #define htobe32(x) __bswap_32 (x) 2025-05-07T20:26:32.6226024Z #define _GLIBCXX_HAVE_POWL 1 2025-05-07T20:26:32.6226555Z #define __FLT64X_MANT_DIG__ 64 2025-05-07T20:26:32.6226824Z #define __GLIBC_HAVE_LONG_LONG 1 2025-05-07T20:26:32.6227086Z #define L_tmpnam 20 2025-05-07T20:26:32.6227301Z #define ___int_wchar_t_h 2025-05-07T20:26:32.6227648Z #define WIFCONTINUED(status) __WIFCONTINUED (__WAIT_INT (status)) 2025-05-07T20:26:32.6228046Z #define isascii(c) __isascii (c) 2025-05-07T20:26:32.6228313Z #define _T_PTRDIFF 2025-05-07T20:26:32.6228630Z #define _GLIBCXX_MOVE3(_Tp,_Up,_Vp) std::move(_Tp, _Up, _Vp) 2025-05-07T20:26:32.6229002Z #define toascii(c) __toascii (c) 2025-05-07T20:26:32.6229254Z #define __GNUC__ 11 2025-05-07T20:26:32.6229504Z #define __SYSCALL_ULONG_TYPE __ULONGWORD_TYPE 2025-05-07T20:26:32.6230081Z #define __GXX_RTTI 1 2025-05-07T20:26:32.6230303Z #define __pie__ 2 2025-05-07T20:26:32.6230517Z #define __MMX__ 1 2025-05-07T20:26:32.6230738Z #define __cudaCDP2Malloc 2025-05-07T20:26:32.6230997Z #define __timespec_defined 1 2025-05-07T20:26:32.6231242Z #define L_ctermid 9 2025-05-07T20:26:32.6231474Z #define __OFF64_T_TYPE __SQUAD_TYPE 2025-05-07T20:26:32.6231783Z #define __cudaCDP2GetParameterBufferV2 2025-05-07T20:26:32.6232176Z #define offsetof(TYPE,MEMBER) __builtin_offsetof (TYPE, MEMBER) 2025-05-07T20:26:32.6232565Z #define _BITS_POSIX2_LIM_H 1 2025-05-07T20:26:32.6232838Z #define _GLIBCXX98_USE_C99_STDLIB 1 2025-05-07T20:26:32.6233231Z #define cudaMemAttachGlobal 0x01 2025-05-07T20:26:32.6233550Z #define FD_SET(fd,fdsetp) __FD_SET (fd, fdsetp) 2025-05-07T20:26:32.6233876Z #define __FLT_HAS_DENORM__ 1 2025-05-07T20:26:32.6234142Z #define __SIZEOF_LONG_DOUBLE__ 16 2025-05-07T20:26:32.6234611Z #define _GLIBCXX_NATIVE_THREAD_ID (__gthread_active_p() ? __gthread_self() : (__gthread_t)1) 2025-05-07T20:26:32.6235411Z #define assert_perror(errnum) (!(errnum) ? __ASSERT_VOID_CAST (0) : __assert_perror_fail ((errnum), __FILE__, __LINE__, __ASSERT_FUNCTION)) 2025-05-07T20:26:32.6236059Z #define _IO_HAVE_ST_BLKSIZE _G_HAVE_ST_BLKSIZE 2025-05-07T20:26:32.6236371Z #define __USE_SVID 1 2025-05-07T20:26:32.6236628Z #define __constant__ __location__(constant) 2025-05-07T20:26:32.6236969Z #define _GLIBCXX_HAVE_POSIX_MEMALIGN 1 2025-05-07T20:26:32.6237282Z #define __device__ __location__(device) 2025-05-07T20:26:32.6237614Z #define _GLIBCXX_HAVE_EXCEPTION_PTR_SINCE_GCC46 1 2025-05-07T20:26:32.6237953Z #define _GLIBCXX_RES_LIMITS 1 2025-05-07T20:26:32.6238239Z #define M_1_PI 0.31830988618379067154 2025-05-07T20:26:32.6238600Z #define CUDART_DEVICE __device__ 2025-05-07T20:26:32.6238955Z #define __LDBL_REDIR1_NTH(name,proto,alias) name proto __THROW 2025-05-07T20:26:32.6239333Z #define M_PI_2 1.57079632679489661923 2025-05-07T20:26:32.6239615Z #define __BIGGEST_ALIGNMENT__ 16 2025-05-07T20:26:32.6239999Z #define cudaExternalSemaphoreWaitSkipNvSciBufMemSync 0x02 2025-05-07T20:26:32.6240403Z #define __STDC_UTF_16__ 1 2025-05-07T20:26:32.6240654Z #define LONG_MAX __LONG_MAX__ 2025-05-07T20:26:32.6241038Z #define __glibcxx_digits10_b(T,B) (__glibcxx_digits_b (T,B) * 643L / 2136) 2025-05-07T20:26:32.6241489Z #define _POSIX_THREAD_DESTRUCTOR_ITERATIONS 4 2025-05-07T20:26:32.6241821Z #define _POSIX_HOST_NAME_MAX 255 2025-05-07T20:26:32.6242093Z #define __FLT64_MAX_10_EXP__ 308 2025-05-07T20:26:32.6242457Z #define NGROUPS_MAX 65536 2025-05-07T20:26:32.6242789Z #define _GLIBCXX_NAMESPACE_LDBL 2025-05-07T20:26:32.6243061Z #define __USE_ISOC95 1 2025-05-07T20:26:32.6243290Z #define _TIME_H 1 2025-05-07T20:26:32.6243651Z #define M_LOG10El 0.434294481903251827651128918916605082L 2025-05-07T20:26:32.6243978Z #define __USE_ISOC99 1 2025-05-07T20:26:32.6244308Z #define __ASMNAME(cname) __ASMNAME2 (__USER_LABEL_PREFIX__, cname) 2025-05-07T20:26:32.6244687Z #define HOST_NAME_MAX 64 2025-05-07T20:26:32.6244934Z #define _POSIX_SEM_NSEMS_MAX 256 2025-05-07T20:26:32.6245197Z #define _IOS_ATEND 4 2025-05-07T20:26:32.6245432Z #define __SM_35_INTRINSICS_H__ 2025-05-07T20:26:32.6245753Z #define WTERMSIG(status) __WTERMSIG (__WAIT_INT (status)) 2025-05-07T20:26:32.6246164Z #define cudaStreamAttrValue cudaLaunchAttributeValue 2025-05-07T20:26:32.6246514Z #define _GLIBCXX_HAVE_S_ISREG 1 2025-05-07T20:26:32.6246800Z #define cudaSurfaceTypeCubemap 0x0C 2025-05-07T20:26:32.6247117Z #define __cpp_delegating_constructors 200604L 2025-05-07T20:26:32.6247438Z #define __FLT32_HAS_INFINITY__ 1 2025-05-07T20:26:32.6247697Z #define _STDIO_H 1 2025-05-07T20:26:32.6248109Z #define __isctype_l(c,type,locale) ((locale)->__ctype_b[(int) (c)] & (unsigned short int) type) 2025-05-07T20:26:32.6248603Z #define _GLIBCXX_PREDEFINED_OPS_H 1 2025-05-07T20:26:32.6248974Z #define __DBL_MAX__ double(1.79769313486231570814527423731704357e+308L) 2025-05-07T20:26:32.6249354Z #define _G_IO_IO_FILE_VERSION 0x20001 2025-05-07T20:26:32.6249650Z #define _POSIX_SIGQUEUE_MAX 32 2025-05-07T20:26:32.6249926Z #define _GLIBCXX_HAVE_GETS 1 2025-05-07T20:26:32.6250196Z #define _GLIBCXX_HAVE_LINUX_TYPES_H 1 2025-05-07T20:26:32.6250486Z #define __cpp_raw_strings 200710L 2025-05-07T20:26:32.6250791Z #define __INT_FAST32_MAX__ 0x7fffffffffffffffL 2025-05-07T20:26:32.6251113Z #define _GLIBCXX_HAVE_VFWSCANF 1 2025-05-07T20:26:32.6251382Z #define __DBL_HAS_INFINITY__ 1 2025-05-07T20:26:32.6251666Z #define __STDCPP_MATH_SPEC_FUNCS__ 201003L 2025-05-07T20:26:32.6251978Z #define _GLIBCXX_STDIO_EOF -1 2025-05-07T20:26:32.6252247Z #define __SIZEOF_PTHREAD_MUTEX_T 40 2025-05-07T20:26:32.6252540Z #define __CHANNEL_DESCRIPTOR_H__ 2025-05-07T20:26:32.6253099Z #define _ISbit(bit) ((bit) < 8 ? ((1 << (bit)) << 8) : ((1 << (bit)) >> 8)) 2025-05-07T20:26:32.6253487Z #define __SIZEOF_FLOAT__ 4 2025-05-07T20:26:32.6253735Z #define __USE_XOPEN 1 2025-05-07T20:26:32.6253980Z #define __SIZEOF_PTHREAD_RWLOCK_T 56 2025-05-07T20:26:32.6254431Z #define cudaStreamAttributeMemSyncDomain cudaLaunchAttributeMemSyncDomain 2025-05-07T20:26:32.6254891Z #define __USE_XOPEN2K 1 2025-05-07T20:26:32.6255134Z #define _PSTL_UDR_PRESENT 1 2025-05-07T20:26:32.6255402Z #define __HAVE_SPECULATION_SAFE_VALUE 1 2025-05-07T20:26:32.6255699Z #define _GLIBCXX_HAVE_COSF 1 2025-05-07T20:26:32.6255972Z #define __cpp_fold_expressions 201603L 2025-05-07T20:26:32.6256518Z #define cudaWaitExternalSemaphoresAsync __CUDART_API_PTSZ(cudaWaitExternalSemaphoresAsync_v2) 2025-05-07T20:26:32.6257064Z #define NL_LANGMAX _POSIX2_LINE_MAX 2025-05-07T20:26:32.6257355Z #define __DEC32_MIN_EXP__ (-94) 2025-05-07T20:26:32.6257722Z #define __glibcxx_requires_partitioned_upper(_First,_Last,_Value) 2025-05-07T20:26:32.6258246Z #define __DADDR_T_TYPE __S32_TYPE 2025-05-07T20:26:32.6258632Z #define cudaExternalSemaphoreSignalSkipNvSciBufMemSync 0x01 2025-05-07T20:26:32.6259039Z #define __END_NAMESPACE_C99 2025-05-07T20:26:32.6259312Z #define __glibcxx_integral_traps true 2025-05-07T20:26:32.6259595Z #define _POSIX_PATH_MAX 256 2025-05-07T20:26:32.6260130Z #define __INTPTR_WIDTH__ 64 2025-05-07T20:26:32.6260390Z #define __FLT64X_HAS_INFINITY__ 1 2025-05-07T20:26:32.6260650Z #define _IOS_TRUNC 16 2025-05-07T20:26:32.6260879Z #define _ISOC11_SOURCE 1 2025-05-07T20:26:32.6261131Z #define _GLIBCXX_HAVE_LINUX_FUTEX 1 2025-05-07T20:26:32.6261419Z #define __UINT_LEAST32_MAX__ 0xffffffffU 2025-05-07T20:26:32.6261721Z #define _GLIBCXX_HAVE_QUICK_EXIT 1 2025-05-07T20:26:32.6262098Z #define __glibcxx_requires_irreflexive_pred2(_First,_Last,_Pred) 2025-05-07T20:26:32.6262488Z #define LONG_MIN (-LONG_MAX - 1L) 2025-05-07T20:26:32.6262768Z #define _GLIBCXX_HAVE_SINCOSF 1 2025-05-07T20:26:32.6263032Z #define _IO_UNITBUF 020000 2025-05-07T20:26:32.6263293Z #define _GLIBCXX_HAVE_SINCOSL 1 2025-05-07T20:26:32.6263556Z #define __FD_SETSIZE 1024 2025-05-07T20:26:32.6263811Z #define getc(_fp) _IO_getc (_fp) 2025-05-07T20:26:32.6264088Z #define be32toh(x) __bswap_32 (x) 2025-05-07T20:26:32.6264432Z #define _GLIBCXX_PACKAGE__GLIBCXX_VERSION "version-unused" 2025-05-07T20:26:32.6264798Z #define __FLT32X_HAS_DENORM__ 1 2025-05-07T20:26:32.6265068Z #define __INT_FAST16_TYPE__ long int 2025-05-07T20:26:32.6265376Z #define isxdigit_l(c,l) __isxdigit_l ((c), (l)) 2025-05-07T20:26:32.6265703Z #define _GLIBCXX_HAVE_GETIPINFO 1 2025-05-07T20:26:32.6265978Z #define __MMX_WITH_SSE__ 1 2025-05-07T20:26:32.6266282Z #define __isalnum_l(c,l) __isctype_l((c), _ISalnum, (l)) 2025-05-07T20:26:32.6266634Z #define _WCHAR_T_DEFINED_ 2025-05-07T20:26:32.6266928Z #define cudaIpcMemLazyEnablePeerAccess 0x01 2025-05-07T20:26:32.6267263Z #define _GLIBCXX_HAVE_AT_QUICK_EXIT 1 2025-05-07T20:26:32.6267563Z #define __INO_T_MATCHES_INO64_T 1 2025-05-07T20:26:32.6267845Z #define __USE_POSIX199506 1 2025-05-07T20:26:32.6268111Z #define _FEATURES_H 1 2025-05-07T20:26:32.6268352Z #define __LDBL_HAS_DENORM__ 1 2025-05-07T20:26:32.6268766Z #define _PSTL_PRAGMA_SIMD_REDUCTION(PRM) _PSTL_PRAGMA(omp simd reduction(PRM)) 2025-05-07T20:26:32.6269270Z #define __WEXITSTATUS(status) (((status) & 0xff00) >> 8) 2025-05-07T20:26:32.6269616Z #define __stub_getmsg 2025-05-07T20:26:32.6270008Z #define _IO_FIXED 010000 2025-05-07T20:26:32.6270288Z #define __cpp_lib_addressof_constexpr 201603 2025-05-07T20:26:32.6270606Z #define _GLIBCXX11_USE_C99_STDIO 1 2025-05-07T20:26:32.6270884Z #define __stub_setlogin 2025-05-07T20:26:32.6271123Z #define __stub_fattach 2025-05-07T20:26:32.6271359Z #define __cplusplus 201703L 2025-05-07T20:26:32.6271631Z #define __cpp_ref_qualifiers 200710L 2025-05-07T20:26:32.6271920Z #define _STRUCT_TIMEVAL 1 2025-05-07T20:26:32.6272173Z #define INFINITY (__builtin_inff()) 2025-05-07T20:26:32.6272457Z #define _IO_UNBUFFERED 2 2025-05-07T20:26:32.6273073Z #define cudaStreamAttributeSynchronizationPolicy cudaLaunchAttributeSynchronizationPolicy 2025-05-07T20:26:32.6273639Z #define _IO_INTERNAL 010 2025-05-07T20:26:32.6273882Z #define __DEC32_MIN__ 1E-95DF 2025-05-07T20:26:32.6274221Z #define cudaKernelNodeAttrValue cudaLaunchAttributeValue 2025-05-07T20:26:32.6274590Z #define __dev_t_defined 2025-05-07T20:26:32.6274820Z #define __DEPRECATED 1 2025-05-07T20:26:32.6275048Z #define __S32_TYPE int 2025-05-07T20:26:32.6275295Z #define __cpp_rvalue_references 200610L 2025-05-07T20:26:32.6275588Z #define __DBL_MAX_EXP__ 1024 2025-05-07T20:26:32.6275848Z #define _IO_fpos_t _G_fpos_t 2025-05-07T20:26:32.6276102Z #define __WCHAR_WIDTH__ 32 2025-05-07T20:26:32.6276742Z #define cudaKernelNodeAttributePreferredSharedMemoryCarveout cudaLaunchAttributePreferredSharedMemoryCarveout 2025-05-07T20:26:32.6277415Z #define _G_HAVE_MREMAP 1 2025-05-07T20:26:32.6277732Z #define __FLT32_MAX__ 3.40282346638528859811704183484516925e+38F32 2025-05-07T20:26:32.6278085Z #define OVERFLOW 3 2025-05-07T20:26:32.6278420Z #define __toascii_l(c,l) ((l), __toascii (c)) 2025-05-07T20:26:32.6278735Z #define __DEC128_EPSILON__ 1E-33DL 2025-05-07T20:26:32.6279023Z #define __SM_32_ATOMIC_FUNCTIONS_H__ 2025-05-07T20:26:32.6279359Z #define _GLIBCXX_DEFAULT_ABI_TAG _GLIBCXX_ABI_TAG_CXX11 2025-05-07T20:26:32.6279697Z #define __SSE2_MATH__ 1 2025-05-07T20:26:32.6279949Z #define __ATOMIC_HLE_RELEASE 131072 2025-05-07T20:26:32.6280255Z #define __FSFILCNT_T_TYPE __SYSCALL_ULONG_TYPE 2025-05-07T20:26:32.6280561Z #define _IO_STDIO_H 2025-05-07T20:26:32.6280808Z #define PDP_ENDIAN __PDP_ENDIAN 2025-05-07T20:26:32.6281096Z #define isspace_l(c,l) __isspace_l ((c), (l)) 2025-05-07T20:26:32.6281424Z #define __cudaCDP2Memcpy2DAsync 2025-05-07T20:26:32.6281724Z #define __PTRDIFF_MAX__ 0x7fffffffffffffffL 2025-05-07T20:26:32.6282036Z #define _GLIBCXX_HAVE_STRERROR_R 1 2025-05-07T20:26:32.6282305Z #define __amd64 1 2025-05-07T20:26:32.6282529Z #define _POSIX_TZNAME_MAX 6 2025-05-07T20:26:32.6283049Z #define __cudaCDP2Memset3DAsync 2025-05-07T20:26:32.6283385Z #define __SYSCALL_WORDSIZE 64 2025-05-07T20:26:32.6283680Z #define _GLIBCXX_HAVE_ATTRIBUTE_VISIBILITY 1 2025-05-07T20:26:32.6283994Z #define _EXT_TYPE_TRAITS 1 2025-05-07T20:26:32.6284256Z #define _GLIBCXX_HAVE_POSIX_SEMAPHORE 1 2025-05-07T20:26:32.6284560Z #define _POSIX_RE_DUP_MAX 255 2025-05-07T20:26:32.6284830Z #define __STDC_NO_THREADS__ 1 2025-05-07T20:26:32.6285078Z #define __bounded 2025-05-07T20:26:32.6285302Z #define _GLIBCXX_HAVE_ACOSL 1 2025-05-07T20:26:32.6285578Z #define __USECONDS_T_TYPE __U32_TYPE 2025-05-07T20:26:32.6285866Z #define _IO_DELETE_DONT_CLOSE 0x40 2025-05-07T20:26:32.6286392Z #define __BEGIN_NAMESPACE_STD 2025-05-07T20:26:32.6286667Z #define _PTRDIFF_T_DECLARED 2025-05-07T20:26:32.6286941Z #define __OFF_T_TYPE __SYSCALL_SLONG_TYPE 2025-05-07T20:26:32.6287271Z #define __W_STOPCODE(sig) ((sig) << 8 | 0x7f) 2025-05-07T20:26:32.6287698Z #define cudaStreamAttributePriority cudaLaunchAttributePriority 2025-05-07T20:26:32.6288114Z #define _GLIBCXX_HAVE_NETDB_H 1 2025-05-07T20:26:32.6288394Z #define __SM_20_INTRINSICS_HPP__ 2025-05-07T20:26:32.6288740Z #define __cpp_lib_has_unique_object_representations 201606 2025-05-07T20:26:32.6289099Z #define STA_PLL 0x0001 2025-05-07T20:26:32.6289470Z #define __ATOMIC_HLE_ACQUIRE 65536 2025-05-07T20:26:32.6289742Z #define __GNUG__ 11 2025-05-07T20:26:32.6289973Z #define _GLIBCXX_USE_GET_NPROCS 1 2025-05-07T20:26:32.6290230Z #define _T_WCHAR 2025-05-07T20:26:32.6290466Z #define __cudaCDP2GetDeviceCount 2025-05-07T20:26:32.6290756Z #define __specialization_static 2025-05-07T20:26:32.6291056Z #define __LONG_LONG_MAX__ 0x7fffffffffffffffLL 2025-05-07T20:26:32.6291372Z #define __SIZEOF_SIZE_T__ 8 2025-05-07T20:26:32.6291635Z #define cudaArraySparse 0x40 2025-05-07T20:26:32.6291893Z #define STA_PPSFREQ 0x0002 2025-05-07T20:26:32.6292176Z #define _IO_stdin ((_IO_FILE*)(&_IO_2_1_stdin_)) 2025-05-07T20:26:32.6292484Z #define _WCHAR_T 2025-05-07T20:26:32.6292707Z #define __cudaCDP2Free 2025-05-07T20:26:32.6293620Z #define __FD_ZERO(fdsp) do { int __d0, __d1; __asm__ __volatile__ ("cld; rep; " __FD_ZERO_STOS : "=c" (__d0), "=D" (__d1) : "a" (0), "0" (sizeof (fd_set) / sizeof (__fd_mask)), "1" (&__FDS_BITS (fdsp)[0]) : "memory"); } while (0) 2025-05-07T20:26:32.6294371Z #define __cpp_nsdmi 200809L 2025-05-07T20:26:32.6294806Z #define __glibcxx_min_b(T,B) (__glibcxx_signed_b (T,B) ? -__glibcxx_max_b (T,B) - 1 : (T)0) 2025-05-07T20:26:32.6295266Z #define __FLT64X_MIN_EXP__ (-16381) 2025-05-07T20:26:32.6295550Z #define __SIZEOF_WINT_T__ 4 2025-05-07T20:26:32.6295814Z #define cudaArrayCubemap 0x04 2025-05-07T20:26:32.6296153Z #define _PSTL_MONOTONIC_PRESENT (__INTEL_COMPILER >= 1800) 2025-05-07T20:26:32.6296506Z #define _GLIBCXX_UTILITY 1 2025-05-07T20:26:32.6296752Z #define __NO_CTYPE 1 2025-05-07T20:26:32.6296979Z #define __stub_bdflush 2025-05-07T20:26:32.6297361Z #define _GLIBCXX_MAKE_MOVE_ITERATOR(_Iter) std::make_move_iterator(_Iter) 2025-05-07T20:26:32.6297801Z #define __CORRECT_ISO_CPP_STRING_H_PROTO 2025-05-07T20:26:32.6298113Z #define _GLIBCXX_STDC_HEADERS 1 2025-05-07T20:26:32.6298508Z #define __LONG_LONG_WIDTH__ 64 2025-05-07T20:26:32.6298785Z #define __cpp_initializer_lists 200806L 2025-05-07T20:26:32.6299096Z #define _GLIBCXX_HAVE_NETINET_TCP_H 1 2025-05-07T20:26:32.6299410Z #define __U16_TYPE unsigned short int 2025-05-07T20:26:32.6307942Z #define __glibcxx_requires_can_increment(_First,_Size) 2025-05-07T20:26:32.6308320Z #define _GLIBCXX_HAVE_SYS_PARAM_H 1 2025-05-07T20:26:32.6308614Z #define __FLT32_MAX_EXP__ 128 2025-05-07T20:26:32.6308912Z #define cudaHostRegisterIoMemory 0x04 2025-05-07T20:26:32.6309273Z #define __FD_MASK(d) ((__fd_mask) 1 << ((d) % __NFDBITS)) 2025-05-07T20:26:32.6309639Z #define __cpp_lib_is_invocable 201703 2025-05-07T20:26:32.6310133Z #define _IO_STDIO 040000 2025-05-07T20:26:32.6310472Z #define _SIGSET_NWORDS (1024 / (8 * sizeof (unsigned long int))) 2025-05-07T20:26:32.6310877Z #define cudaSurfaceType1DLayered 0xF1 2025-05-07T20:26:32.6311195Z #define cudaArraySurfaceLoadStore 0x02 2025-05-07T20:26:32.6311505Z #define _PTRDIFF_T 2025-05-07T20:26:32.6311735Z #define _MOVE_H 1 2025-05-07T20:26:32.6311964Z #define __cpp_hex_float 201603L 2025-05-07T20:26:32.6312238Z #define ADJ_TAI 0x0080 2025-05-07T20:26:32.6312467Z #define __ptrvalue 2025-05-07T20:26:32.6312701Z #define _GLIBCXX_HOSTED 1 2025-05-07T20:26:32.6312966Z #define __GXX_ABI_VERSION 1016 2025-05-07T20:26:32.6313259Z #define __WTERMSIG(status) ((status) & 0x7f) 2025-05-07T20:26:32.6313564Z #define MATH_ERREXCEPT 2 2025-05-07T20:26:32.6313826Z #define _GLIBCXX_HAS_GTHREADS 1 2025-05-07T20:26:32.6314115Z #define cudaTextureType2DLayered 0xF2 2025-05-07T20:26:32.6314524Z #define __isleap(year) ((year) % 4 == 0 && ((year) % 100 != 0 || (year) % 400 == 0)) 2025-05-07T20:26:32.6314933Z #define __USE_GNU 1 2025-05-07T20:26:32.6315174Z #define __FLT128_HAS_INFINITY__ 1 2025-05-07T20:26:32.6315454Z #define __FLT_MIN_EXP__ (-125) 2025-05-07T20:26:32.6315732Z #define __GCC_HAVE_DWARF2_CFI_ASM 1 2025-05-07T20:26:32.6316141Z #define __FD_CLR(d,set) ((void) (__FDS_BITS (set)[__FD_ELT (d)] &= ~__FD_MASK (d))) 2025-05-07T20:26:32.6316549Z #define WEXITED 4 2025-05-07T20:26:32.6316768Z #define _IO_NO_READS 4 2025-05-07T20:26:32.6317071Z #define cudaGraphKernelNodePortLaunchCompletion 2 2025-05-07T20:26:32.6317430Z #define M_LOG2E 1.4426950408889634074 2025-05-07T20:26:32.6317715Z #define _POSIX_SYMLINK_MAX 255 2025-05-07T20:26:32.6318026Z #define _GLIBCXX_HAVE_BUILTIN_HAS_UNIQ_OBJ_REP 1 2025-05-07T20:26:32.6318352Z #define __uid_t_defined 2025-05-07T20:26:32.6318886Z #define __FD_ELT(d) ((d) / __NFDBITS) 2025-05-07T20:26:32.6319185Z #define _GLIBCXX_USE_STD_SPEC_FUNCS 1 2025-05-07T20:26:32.6319458Z #define WNOHANG 1 2025-05-07T20:26:32.6319706Z #define alloca(size) __builtin_alloca (size) 2025-05-07T20:26:32.6320021Z #define _GLIBCXX_HAVE_HYPOTF 1 2025-05-07T20:26:32.6320295Z #define cudaEventDefault 0x00 2025-05-07T20:26:32.6320594Z #define __maxnreg__(a) __attribute__((maxnreg(a))) 2025-05-07T20:26:32.6320922Z #define NL_SETMAX INT_MAX 2025-05-07T20:26:32.6321165Z #define __x86_64 1 2025-05-07T20:26:32.6321566Z #define __cudaCDP2LaunchDevice 2025-05-07T20:26:32.6321975Z #define __REDIRECT(name,proto,alias) name proto __asm__ (__ASMNAME (#alias)) 2025-05-07T20:26:32.6322478Z #define _GLIBCXX_BEGIN_NAMESPACE_CXX11 namespace __cxx11 { 2025-05-07T20:26:32.6322992Z #define __extern_always_inline extern __always_inline __attribute__ ((__gnu_inline__)) 2025-05-07T20:26:32.6323445Z #define __PTRDIFF_T 2025-05-07T20:26:32.6323771Z #define __exctype_l(name) extern int name (int, __locale_t) __THROW 2025-05-07T20:26:32.6324154Z #define _GLIBCXX_HAVE_FINITEL 1 2025-05-07T20:26:32.6324431Z #define __SM_35_ATOMIC_FUNCTIONS_H__ 2025-05-07T20:26:32.6324728Z #define _Mlong_double_ long double 2025-05-07T20:26:32.6325015Z #define __cpp_lambdas 200907L 2025-05-07T20:26:32.6325262Z #define _IO_DEC 020 2025-05-07T20:26:32.6325487Z #define _GLIBCXX_HAVE_SINHL 1 2025-05-07T20:26:32.6325759Z #define _POSIX_CLOCKRES_MIN 20000000 2025-05-07T20:26:32.6326048Z #define __INT_FAST64_TYPE__ long int 2025-05-07T20:26:32.6326434Z #define ADJ_TIMECONST 0x0020 2025-05-07T20:26:32.6326697Z #define _GLIBCXX_HAVE_SQRTL 1 2025-05-07T20:26:32.6326991Z #define __cudaCDP2DeviceGetSharedMemConfig 2025-05-07T20:26:32.6327321Z #define _GLIBCXX_HAVE_STDALIGN_H 1 2025-05-07T20:26:32.6327598Z #define _ANSI_STDDEF_H 2025-05-07T20:26:32.6327872Z #define _GLIBCXX_MOVE(__val) std::move(__val) 2025-05-07T20:26:32.6328191Z #define _GLIBCXX_HAVE_STRERROR_L 1 2025-05-07T20:26:32.6328574Z #define __FLT64_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F64 2025-05-07T20:26:32.6328978Z #define _GLIBCXX_USE_DEV_RANDOM 1 2025-05-07T20:26:32.6329258Z #define _STL_ITERATOR_BASE_TYPES_H 1 2025-05-07T20:26:32.6329555Z #define __cpp_template_auto 201606L 2025-05-07T20:26:32.6329923Z #define __DBL_MIN__ double(2.22507385850720138309023271733240406e-308L) 2025-05-07T20:26:32.6330302Z #define _GLIBCXX_HAVE_SYS_SEM_H 1 2025-05-07T20:26:32.6330576Z #define __key_t_defined 2025-05-07T20:26:32.6330827Z #define _IO_MAGIC_MASK 0xFFFF0000 2025-05-07T20:26:32.6331210Z #define __cluster_dims__(...) __attribute__((cluster_dims(__VA_ARGS__))) 2025-05-07T20:26:32.6331701Z #define __FLT128_EPSILON__ 1.92592994438723585305597794258492732e-34F128 2025-05-07T20:26:32.6332089Z #define __GNUC_VA_LIST 2025-05-07T20:26:32.6332426Z #define __FLT64X_NORM_MAX__ 1.18973149535723176502126385303097021e+4932F64x 2025-05-07T20:26:32.6332830Z #define __SIZEOF_POINTER__ 8 2025-05-07T20:26:32.6333102Z #define CLOCK_REALTIME_COARSE 5 2025-05-07T20:26:32.6333385Z #define _GLIBCXX14_CONSTEXPR constexpr 2025-05-07T20:26:32.6333685Z #define __USE_XOPEN2KXSI 1 2025-05-07T20:26:32.6333940Z #define __WCOREFLAG 0x80 2025-05-07T20:26:32.6334197Z #define M_2_SQRTPI 1.12837916709551257390 2025-05-07T20:26:32.6334508Z #define cudaEventDisableTiming 0x02 2025-05-07T20:26:32.6334796Z #define __LP64__ 1 2025-05-07T20:26:32.6335044Z #define __isascii_l(c,l) ((l), __isascii (c)) 2025-05-07T20:26:32.6335365Z #define cudaStreamNonBlocking 0x01 2025-05-07T20:26:32.6335659Z #define _IO_off64_t __off64_t 2025-05-07T20:26:32.6335937Z #define __DBL_HAS_QUIET_NAN__ 1 2025-05-07T20:26:32.6336194Z #define __time_t_defined 1 2025-05-07T20:26:32.6336456Z #define _POSIX_SYMLOOP_MAX 8 2025-05-07T20:26:32.6336867Z #define __FLT32X_EPSILON__ 2.22044604925031308084726333618164062e-16F32x 2025-05-07T20:26:32.6337263Z #define __USE_UNIX98 1 2025-05-07T20:26:32.6337506Z #define __MODE_T_TYPE __U32_TYPE 2025-05-07T20:26:32.6337782Z #define CLOCK_REALTIME_ALARM 8 2025-05-07T20:26:32.6338056Z #define _GLIBCXX_HAVE_STRINGS_H 1 2025-05-07T20:26:32.6338353Z #define __LEAF_ATTR __attribute__ ((__leaf__)) 2025-05-07T20:26:32.6338672Z #define __DECIMAL_BID_FORMAT__ 1 2025-05-07T20:26:32.6338936Z #define SEEK_CUR 1 2025-05-07T20:26:32.6339162Z #define __RLIM64_T_TYPE __UQUAD_TYPE 2025-05-07T20:26:32.6339440Z #define _ASSERT_H 1 2025-05-07T20:26:32.6340042Z #define _PSTL_PRAGMA_DECLARE_REDUCTION(NAME,OP) _PSTL_PRAGMA(omp declare reduction(NAME:OP : omp_out(omp_in)) initializer(omp_priv = omp_orig)) 2025-05-07T20:26:32.6340858Z #define _GLIBCXX_USE_DEPRECATED 1 2025-05-07T20:26:32.6341138Z #define CHAR_MAX SCHAR_MAX 2025-05-07T20:26:32.6341399Z #define _GLIBCXX_HAVE_SETENV 1 2025-05-07T20:26:32.6341665Z #define NL_ARGMAX _POSIX_ARG_MAX 2025-05-07T20:26:32.6341938Z #define _GLIBCXX_USE_UTIMENSAT 1 2025-05-07T20:26:32.6342324Z #define __extern_inline extern __inline __attribute__ ((__gnu_inline__)) 2025-05-07T20:26:32.6342751Z #define _GLIBCXX_DEBUG_ONLY(_Statement) 2025-05-07T20:26:32.6343438Z #define _IO_putc_unlocked(_ch,_fp) (_IO_BE ((_fp)->_IO_write_ptr >= (_fp)->_IO_write_end, 0) ? __overflow (_fp, (unsigned char) (_ch)) : (unsigned char) (*(_fp)->_IO_write_ptr++ = (_ch))) 2025-05-07T20:26:32.6344142Z #define _GLIBCXX_HAVE_BUILTIN_LAUNDER 1 2025-05-07T20:26:32.6344545Z #define _IO_BOOLALPHA 0200000 2025-05-07T20:26:32.6344963Z #define _PSTL_CPP17_EXECUTION_POLICIES_PRESENT (_MSC_VER >= 1912) 2025-05-07T20:26:32.6345401Z #define _GLIBCXX_PACKAGE_URL "" 2025-05-07T20:26:32.6345717Z #define __FLT64_MIN_10_EXP__ (-307) 2025-05-07T20:26:32.6346124Z #define cudaArrayDefault 0x00 2025-05-07T20:26:32.6346708Z #define __cudaCDP2LaunchDeviceV2 2025-05-07T20:26:32.6347017Z #define __FDS_BITS(set) ((set)->fds_bits) 2025-05-07T20:26:32.6347310Z #define TLOSS 5 2025-05-07T20:26:32.6347525Z #define __ssize_t_defined 2025-05-07T20:26:32.6347787Z #define __CUDACC_VER_BUILD__ 61 2025-05-07T20:26:32.6348069Z #define ULONG_MAX (LONG_MAX * 2UL + 1UL) 2025-05-07T20:26:32.6348369Z #define __FLT64X_DECIMAL_DIG__ 21 2025-05-07T20:26:32.6348655Z #define _POSIX_HIWAT _POSIX_PIPE_BUF 2025-05-07T20:26:32.6349087Z #define __DEC128_MIN__ 1E-6143DL 2025-05-07T20:26:32.6349375Z #define __cudaCDP2EventRecordWithFlags 2025-05-07T20:26:32.6349819Z #define _GLIBCXX_ATOMIC_BUILTINS 1 2025-05-07T20:26:32.6350124Z #define cudaPeerAccessDefault 0x00 2025-05-07T20:26:32.6350418Z #define _GLIBCXX_HAVE_SYS_SOCKET_H 1 2025-05-07T20:26:32.6350714Z #define __REGISTER_PREFIX__ 2025-05-07T20:26:32.6350976Z #define __UINT16_MAX__ 0xffff 2025-05-07T20:26:32.6351323Z #define __glibcxx_requires_sorted_set(_First1,_Last1,_First2) 2025-05-07T20:26:32.6351701Z #define _IOS_NOREPLACE 64 2025-05-07T20:26:32.6351946Z #define __cdecl 2025-05-07T20:26:32.6352188Z #define cudaEventInterprocess 0x04 2025-05-07T20:26:32.6352522Z #define M_SQRT1_2l 0.707106781186547524400844362104849039L 2025-05-07T20:26:32.6352865Z #define LOGIN_NAME_MAX 256 2025-05-07T20:26:32.6353124Z #define _IO_TIED_PUT_GET 0x400 2025-05-07T20:26:32.6353394Z #define X_TLOSS 1.41484755040568800000e+16 2025-05-07T20:26:32.6353696Z #define CUDA_IPC_HANDLE_SIZE 64 2025-05-07T20:26:32.6353970Z #define __LDBL_HAS_INFINITY__ 1 2025-05-07T20:26:32.6354281Z #define __attribute_pure__ __attribute__ ((__pure__)) 2025-05-07T20:26:32.6354622Z #define __TEXTURE_TYPES_H__ 2025-05-07T20:26:32.6355042Z #define __NV_GLIBCXX_VERSION (__GNUC__ * 10000 + __GNUC_MINOR__ * 100 + __GNUC_PATCHLEVEL__) 2025-05-07T20:26:32.6355496Z #define ADJ_NANO 0x2000 2025-05-07T20:26:32.6355801Z #define __FLT32_MIN__ 1.17549435082228750796873653722224568e-38F32 2025-05-07T20:26:32.6356184Z #define __UINT8_TYPE__ unsigned char 2025-05-07T20:26:32.6356475Z #define _GLIBCXX_HAVE_ISWBLANK 1 2025-05-07T20:26:32.6356734Z #define __FLT_DIG__ 6 2025-05-07T20:26:32.6357095Z #define __REDIRECT_LDBL(name,proto,alias) __REDIRECT (name, proto, alias) 2025-05-07T20:26:32.6357510Z #define __NO_INLINE__ 1 2025-05-07T20:26:32.6357810Z #define _PSTL_EARLYEXIT_PRESENT (__INTEL_COMPILER >= 1800) 2025-05-07T20:26:32.6358175Z #define _POSIX_NGROUPS_MAX 8 2025-05-07T20:26:32.6358436Z #define ADJ_STATUS 0x0010 2025-05-07T20:26:32.6358697Z #define __cudaCDP2MemcpyAsync_ptsz 2025-05-07T20:26:32.6358994Z #define CLOCK_BOOTTIME_ALARM 9 2025-05-07T20:26:32.6359268Z #define LONG_LONG_MAX __LONG_LONG_MAX__ 2025-05-07T20:26:32.6359573Z #define _GLIBCXX_HAVE_OBSOLETE_ISNAN 1 2025-05-07T20:26:32.6359862Z #define __DEC_EVAL_METHOD__ 2 2025-05-07T20:26:32.6360256Z #define cudaStreamGraphFireAndForget (cudaStream_t)0x0200000000000000 2025-05-07T20:26:32.6360689Z #define _GLIBCXX_HAVE_ALIGNED_ALLOC 1 2025-05-07T20:26:32.6361157Z #define __DEC128_MAX__ 9.999999999999999999999999999999999E6144DL 2025-05-07T20:26:32.6361523Z #define CHAR_MIN SCHAR_MIN 2025-05-07T20:26:32.6361766Z #define MAX_CANON 255 2025-05-07T20:26:32.6361995Z #define __FLT_MANT_DIG__ 24 2025-05-07T20:26:32.6362254Z #define __LDBL_DECIMAL_DIG__ 21 2025-05-07T20:26:32.6362523Z #define _GLIBCXX_HAVE_COMPLEX_H 1 2025-05-07T20:26:32.6362809Z #define _PSTL_PRAGMA_VECTOR_UNALIGNED 2025-05-07T20:26:32.6363120Z #define _POSIX_FD_SETSIZE _POSIX_OPEN_MAX 2025-05-07T20:26:32.6363428Z #define _GLIBCXX_HAVE_HYPOT 1 2025-05-07T20:26:32.6363705Z #define __cudaCDP2Memset2DAsync_ptsz 2025-05-07T20:26:32.6364028Z #define _GLIBCXX_TR1_MODIFIED_BESSEL_FUNC_TCC 1 2025-05-07T20:26:32.6364351Z #define __VERSION__ "11.4.0" 2025-05-07T20:26:32.6364618Z #define _GLIBCXX11_USE_C99_STDLIB 1 2025-05-07T20:26:32.6364909Z #define cudaHostRegisterMapped 0x02 2025-05-07T20:26:32.6365206Z #define _GLIBCXX_HAVE_INT64_T 1 2025-05-07T20:26:32.6365495Z #define _GLIBCXX_USE_CONSTEXPR constexpr 2025-05-07T20:26:32.6365899Z #define FD_ZERO(fdsetp) __FD_ZERO (fdsetp) 2025-05-07T20:26:32.6366206Z #define __UINT64_C(c) c ## UL 2025-05-07T20:26:32.6366468Z #define MOD_OFFSET ADJ_OFFSET 2025-05-07T20:26:32.6366719Z #define _SYS_TYPES_H 1 2025-05-07T20:26:32.6366961Z #define AIO_PRIO_DELTA_MAX 20 2025-05-07T20:26:32.6367229Z #define _GLIBCXX_HAVE_TANHF 1 2025-05-07T20:26:32.6367478Z #define _SYS_CDEFS_H 1 2025-05-07T20:26:32.6367719Z #define _GLIBCXX_HAVE_TANHL 1 2025-05-07T20:26:32.6367996Z #define __cpp_unicode_characters 201411L 2025-05-07T20:26:32.6368291Z #define _IO_ERR_SEEN 0x20 2025-05-07T20:26:32.6368545Z #define _GLIBCXX_USE_DECIMAL_FLOAT 1 2025-05-07T20:26:32.6368866Z #define __cudaCDP2StreamDestroy 2025-05-07T20:26:32.6369139Z #define FP_SUBNORMAL 3 2025-05-07T20:26:32.6369383Z #define cudaOccupancyDefault 0x00 2025-05-07T20:26:32.6369666Z #define _INITIALIZER_LIST 2025-05-07T20:26:32.6369917Z #define _STDC_PREDEF_H 1 2025-05-07T20:26:32.6370171Z #define _GLIBCXX_PACKAGE_BUGREPORT "" 2025-05-07T20:26:32.6370478Z #define _GLIBCXX_HAVE_MODF 1 2025-05-07T20:26:32.6370743Z #define _IO_file_flags _flags 2025-05-07T20:26:32.6371004Z #define __USE_XOPEN2K8 1 2025-05-07T20:26:32.6371249Z #define htobe64(x) __bswap_64 (x) 2025-05-07T20:26:32.6371531Z #define _OLD_STDIO_MAGIC 0xFABC0000 2025-05-07T20:26:32.6371813Z #define HUGE 3.40282347e+38F 2025-05-07T20:26:32.6372075Z #define __cpp_lib_is_null_pointer 201309 2025-05-07T20:26:32.6372473Z #define WEXITSTATUS(status) __WEXITSTATUS (__WAIT_INT (status)) 2025-05-07T20:26:32.6372886Z #define islower_l(c,l) __islower_l ((c), (l)) 2025-05-07T20:26:32.6373195Z #define _GLIBCXX_USE_CXX11_ABI 1 2025-05-07T20:26:32.6373470Z #define _GLIBCXX_HAVE_SYMLINK 1 2025-05-07T20:26:32.6373730Z #define _BSD_SOURCE 1 2025-05-07T20:26:32.6373958Z #define _GLIBCXX_THROW(_EXC) 2025-05-07T20:26:32.6374867Z #define _GLIBCXX_HAS_NESTED_TYPE(_NTYPE) template> struct __has_ ##_NTYPE : false_type { }; template struct __has_ ##_NTYPE<_Tp, __void_t> : true_type { }; 2025-05-07T20:26:32.6375795Z #define __catch(X) catch(X) 2025-05-07T20:26:32.6376059Z #define __INT_LEAST32_MAX__ 0x7fffffff 2025-05-07T20:26:32.6376348Z #define LINE_MAX _POSIX2_LINE_MAX 2025-05-07T20:26:32.6376624Z #define __TIMER_T_TYPE void * 2025-05-07T20:26:32.6376878Z #define __STRING(x) #x 2025-05-07T20:26:32.6377112Z #define __GCC_ATOMIC_INT_LOCK_FREE 2 2025-05-07T20:26:32.6377393Z #define _T_PTRDIFF_ 2025-05-07T20:26:32.6377636Z #define _GLIBCXX_USE_NOEXCEPT noexcept 2025-05-07T20:26:32.6377938Z #define cudaEventWaitExternal 0x01 2025-05-07T20:26:32.6378216Z #define __unbounded 2025-05-07T20:26:32.6378459Z #define __DEVICE_ATOMIC_FUNCTIONS_H__ 2025-05-07T20:26:32.6378750Z #define __FLT128_MAX_EXP__ 16384 2025-05-07T20:26:32.6379034Z #define __INO_T_TYPE __SYSCALL_ULONG_TYPE 2025-05-07T20:26:32.6379344Z #define be16toh(x) __bswap_16 (x) 2025-05-07T20:26:32.6379626Z #define __cpp_lib_is_final 201402L 2025-05-07T20:26:32.6380017Z #define _GLIBCXX_BEGIN_NAMESPACE_CONTAINER 2025-05-07T20:26:32.6380355Z #define LONG_LONG_MIN (-LONG_LONG_MAX - 1LL) 2025-05-07T20:26:32.6380874Z #define __MATH_DECLARE_LDOUBLE 1 2025-05-07T20:26:32.6381152Z #define __managed__ __location__(managed) 2025-05-07T20:26:32.6381457Z #define _POSIX2_EXPR_NEST_MAX 32 2025-05-07T20:26:32.6381869Z #define __GNUC_PREREQ(maj,min) ((__GNUC__ << 16) + __GNUC_MINOR__ >= ((maj) << 16) + (min)) 2025-05-07T20:26:32.6382308Z #define _POSIX_STREAM_MAX 8 2025-05-07T20:26:32.6382571Z #define __LIBRARY_TYPES_H__ 2025-05-07T20:26:32.6383238Z #define _GLIBCXX_END_NAMESPACE_LDBL_OR_CXX11 _GLIBCXX_END_NAMESPACE_CXX11 2025-05-07T20:26:32.6383662Z #define __FLT32_MANT_DIG__ 24 2025-05-07T20:26:32.6383918Z #define _SYS_SIZE_T_H 2025-05-07T20:26:32.6384216Z #define _PSTL_VERSION_MINOR ((_PSTL_VERSION % 1000) / 10) 2025-05-07T20:26:32.6384568Z #define _GLIBCXX_STDLIB_H 1 2025-05-07T20:26:32.6384846Z #define isupper_l(c,l) __isupper_l ((c), (l)) 2025-05-07T20:26:32.6385152Z #define _CRTIMP 2025-05-07T20:26:32.6385647Z #define _GLIBCXX_CXX_CONFIG_H 1 2025-05-07T20:26:32.6385950Z #define __FLOAT_WORD_ORDER__ __ORDER_LITTLE_ENDIAN__ 2025-05-07T20:26:32.6386297Z #define STA_PPSJITTER 0x0200 2025-05-07T20:26:32.6386663Z #define _IO_feof_unlocked(__fp) (((__fp)->_flags & _IO_EOF_SEEN) != 0) 2025-05-07T20:26:32.6387082Z #define __SUSECONDS_T_TYPE __SYSCALL_SLONG_TYPE 2025-05-07T20:26:32.6387411Z #define _GLIBCXX_HAVE_ISINFF 1 2025-05-07T20:26:32.6387700Z #define __glibcxx_requires_subscript(_N) 2025-05-07T20:26:32.6387990Z #define __SIZE_T__ 2025-05-07T20:26:32.6388208Z #define __stub_gtty 2025-05-07T20:26:32.6388442Z #define __pid_t_defined 2025-05-07T20:26:32.6388711Z #define _GLIBCXX_FWDREF(_Tp) _Tp&& 2025-05-07T20:26:32.6389006Z #define __NLINK_T_TYPE __SYSCALL_ULONG_TYPE 2025-05-07T20:26:32.6389329Z #define __glibcxx_function_requires(...) 2025-05-07T20:26:32.6389632Z #define __SM_80_RT_HPP__ 2025-05-07T20:26:32.6389971Z #define __need_clockid_t 2025-05-07T20:26:32.6390233Z #define SSIZE_MAX LONG_MAX 2025-05-07T20:26:32.6390494Z #define _GLIBCXX_HAVE_USELOCALE 1 2025-05-07T20:26:32.6390809Z #define __glibcxx_requires_string_len(_String,_Len) 2025-05-07T20:26:32.6391134Z #define _IO_HEX 0100 2025-05-07T20:26:32.6391393Z #define __NFDBITS (8 * (int) sizeof (__fd_mask)) 2025-05-07T20:26:32.6391730Z #define cudaExternalMemoryDedicated 0x1 2025-05-07T20:26:32.6391834Z #define _GLIBCXX_HAVE_TGMATH_H 1 2025-05-07T20:26:32.6391935Z #define _GLIBCXX11_USE_C99_COMPLEX 1 2025-05-07T20:26:32.6392162Z #define _GLIBCXX17_DEPRECATED_SUGGEST(ALT) _GLIBCXX_DEPRECATED_SUGGEST(ALT) 2025-05-07T20:26:32.6392284Z #define ispunct_l(c,l) __ispunct_l ((c), (l)) 2025-05-07T20:26:32.6392389Z #define __cpp_aggregate_bases 201603L 2025-05-07T20:26:32.6392495Z #define __cudaGet_blockDim() blockDim 2025-05-07T20:26:32.6392598Z #define __cudaCDP2Memcpy3DAsync 2025-05-07T20:26:32.6392698Z #define __cudaCDP2MemcpyAsync 2025-05-07T20:26:32.6392785Z #define __stub_sstk 2025-05-07T20:26:32.6392876Z #define _IO_IN_BACKUP 0x100 2025-05-07T20:26:32.6393042Z #define _GLIBCXX_USE_C99_STDLIB _GLIBCXX11_USE_C99_STDLIB 2025-05-07T20:26:32.6393130Z #define __wur 2025-05-07T20:26:32.6393247Z #define isprint_l(c,l) __isprint_l ((c), (l)) 2025-05-07T20:26:32.6393334Z #define _G_HAVE_MMAP 1 2025-05-07T20:26:32.6393422Z #define _IO_OCT 040 2025-05-07T20:26:32.6393514Z #define __FLT128_HAS_DENORM__ 1 2025-05-07T20:26:32.6393601Z #define NL_MSGMAX INT_MAX 2025-05-07T20:26:32.6393697Z #define _GLIBCXX_USE_LFS 1 2025-05-07T20:26:32.6393823Z #define cudaDeviceScheduleBlockingSync 0x04 2025-05-07T20:26:32.6393921Z #define _POSIX_RTSIG_MAX 8 2025-05-07T20:26:32.6394023Z #define _GLIBCXX_NOEXCEPT noexcept 2025-05-07T20:26:32.6394215Z #define __glibcxx_requires_partitioned_lower(_First,_Last,_Value) 2025-05-07T20:26:32.6394315Z #define __FLT32_DECIMAL_DIG__ 9 2025-05-07T20:26:32.6394404Z #define _STL_ALGOBASE_H 1 2025-05-07T20:26:32.6394513Z #define __cudaCDP2MemsetAsync_ptsz 2025-05-07T20:26:32.6394606Z #define __off64_t_defined 2025-05-07T20:26:32.6394849Z #define _GLIBCXX_WEAK_DEFINITION 2025-05-07T20:26:32.6394938Z #define __FLT128_DIG__ 33 2025-05-07T20:26:32.6395046Z #define _GLIBCXX_USE_C99_INTTYPES_TR1 1 2025-05-07T20:26:32.6395143Z #define _GLIBCXX_HAVE_LOCALE_H 1 2025-05-07T20:26:32.6395233Z #define __INT32_C(c) c 2025-05-07T20:26:32.6395327Z #define __DEC64_EPSILON__ 1E-15DD 2025-05-07T20:26:32.6395424Z #define __ORDER_PDP_ENDIAN__ 3412 2025-05-07T20:26:32.6395523Z #define __DEC128_MIN_EXP__ (-6142) 2025-05-07T20:26:32.6395615Z #define __PDP_ENDIAN 3412 2025-05-07T20:26:32.6395701Z #define _ISOC95_SOURCE 1 2025-05-07T20:26:32.6395803Z #define _IO_fpos64_t _G_fpos64_t 2025-05-07T20:26:32.6395934Z #define M_PI_2l 1.570796326794896619231321691639751442L 2025-05-07T20:26:32.6396028Z #define BYTE_ORDER __BYTE_ORDER 2025-05-07T20:26:32.6396122Z #define __SM_90_RT_HPP__ 2025-05-07T20:26:32.6396218Z #define __INT_FAST32_TYPE__ long int 2025-05-07T20:26:32.6396312Z #define __have_pthread_attr_t 1 2025-05-07T20:26:32.6396423Z #define _GLIBCXX_HAVE_LIMIT_DATA 1 2025-05-07T20:26:32.6396732Z #define _GLIBCXX_BEGIN_NAMESPACE_LDBL_OR_CXX11 _GLIBCXX_BEGIN_NAMESPACE_CXX11 2025-05-07T20:26:32.6396844Z #define __cudaCDP2StreamWaitEvent 2025-05-07T20:26:32.6396944Z #define __cudaCDP2EventRecord 2025-05-07T20:26:32.6397037Z #define _BITS_TYPESIZES_H 1 2025-05-07T20:26:32.6397128Z #define htole32(x) (x) 2025-05-07T20:26:32.6397381Z #define __cudaCDP2OccupancyMaxActiveBlocksPerMultiprocessorWithFlags 2025-05-07T20:26:32.6397501Z #define __SYSCALL_SLONG_TYPE __SLONGWORD_TYPE 2025-05-07T20:26:32.6397607Z #define _GLIBCXX_USE_C99_MATH_TR1 1 2025-05-07T20:26:32.6397764Z #define WSTOPSIG(status) __WSTOPSIG (__WAIT_INT (status)) 2025-05-07T20:26:32.6397902Z #define _GLIBCXX_USE_C99_MATH _GLIBCXX11_USE_C99_MATH 2025-05-07T20:26:32.6398036Z #define __UINT_LEAST16_TYPE__ short unsigned int 2025-05-07T20:26:32.6398175Z #define __WIFEXITED(status) (__WTERMSIG(status) == 0) 2025-05-07T20:26:32.6398273Z #define ADJ_OFFSET 0x0001 2025-05-07T20:26:32.6398381Z #define cudaArrayLayered 0x01 2025-05-07T20:26:32.6398558Z #define _PSTL_ICC_18_OMP_SIMD_BROKEN (__INTEL_COMPILER == 1800) 2025-05-07T20:26:32.6398672Z #define cudaEventRecordDefault 0x00 2025-05-07T20:26:32.6398766Z #define _GLIBCXX_HAVE_FMODF 1 2025-05-07T20:26:32.6398866Z #define _PSTL_PRAGMA_MESSAGE(x) 2025-05-07T20:26:32.6398951Z #define unix 1 2025-05-07T20:26:32.6399046Z #define __DBL_HAS_DENORM__ 1 2025-05-07T20:26:32.6399137Z #define _POSIX_CHILD_MAX 25 2025-05-07T20:26:32.6399236Z #define _POSIX_MAX_INPUT 255 2025-05-07T20:26:32.6399352Z #define __cudaCDP2DeviceGetCacheConfig 2025-05-07T20:26:32.6399437Z #define __USE_POSIX 1 2025-05-07T20:26:32.6399540Z #define __FD_ZERO_STOS "stosq" 2025-05-07T20:26:32.6399670Z #define _PSTL_VERSION_MAJOR (_PSTL_VERSION / 1000) 2025-05-07T20:26:32.6399766Z #define __THROWNL throw () 2025-05-07T20:26:32.6399857Z #define __cpp_rtti 199711L 2025-05-07T20:26:32.6399958Z #define __SIZE_TYPE__ long unsigned int 2025-05-07T20:26:32.6400052Z #define __PMT(args) args 2025-05-07T20:26:32.6400175Z #define __UINT64_MAX__ 0xffffffffffffffffUL 2025-05-07T20:26:32.6400322Z #define __va_arg_pack_len() __builtin_va_arg_pack_len () 2025-05-07T20:26:32.6400441Z #define __ULONGWORD_TYPE unsigned long int 2025-05-07T20:26:32.6400532Z #define _SIZE_T_DECLARED 2025-05-07T20:26:32.6400629Z #define _PSTL_STRING_AUX(x) #x 2025-05-07T20:26:32.6400729Z #define __FLT_IS_IEC_60559__ 2 2025-05-07T20:26:32.6401150Z #define _PSTL_CPP14_MAKE_REVERSE_ITERATOR_PRESENT (_MSC_VER >= 1900 || __cplusplus >= 201402L || __cpp_lib_make_reverse_iterator == 201402) 2025-05-07T20:26:32.6401259Z #define _GLIBCXX_HAVE_LIMIT_AS 1 2025-05-07T20:26:32.6401352Z #define XATTR_LIST_MAX 65536 2025-05-07T20:26:32.6401449Z #define __CUDACC_VER_MAJOR__ 12 2025-05-07T20:26:32.6401596Z #define __GNUC_WIDE_EXECUTION_CHARSET_NAME "UTF-32LE" 2025-05-07T20:26:32.6401679Z #define _WCHAR_T_H 2025-05-07T20:26:32.6401769Z #define __FLT64X_DIG__ 18 2025-05-07T20:26:32.6401863Z #define _IO_SHOWBASE 0200 2025-05-07T20:26:32.6401949Z #define _POSIX_QLIMIT 1 2025-05-07T20:26:32.6402184Z #define __INT8_TYPE__ signed char 2025-05-07T20:26:32.6402289Z #define __SURFACE_TYPES_H__ 2025-05-07T20:26:32.6402380Z #define __CUDA_ARCH__ 520 2025-05-07T20:26:32.6402485Z #define __cpp_digit_separators 201309L 2025-05-07T20:26:32.6402571Z #define __ELF__ 1 2025-05-07T20:26:32.6402671Z #define CLOCK_THREAD_CPUTIME_ID 3 2025-05-07T20:26:32.6402778Z #define __GCC_ASM_FLAG_OUTPUTS__ 1 2025-05-07T20:26:32.6402863Z #define STA_INS 0x0010 2025-05-07T20:26:32.6402961Z #define __UINT32_TYPE__ unsigned int 2025-05-07T20:26:32.6403141Z #define _toupper(c) ((int) (*__ctype_toupper_loc ())[(int) (c)]) 2025-05-07T20:26:32.6403232Z #define _BITS_BYTESWAP_H 1 2025-05-07T20:26:32.6403328Z #define __ID_T_TYPE __U32_TYPE 2025-05-07T20:26:32.6403445Z #define __TIME_T_TYPE __SYSCALL_SLONG_TYPE 2025-05-07T20:26:32.6403551Z #define __DEVICE_DOUBLE_FUNCTIONS_HPP__ 2025-05-07T20:26:32.6403646Z #define _GLIBCXX_HAVE_MBSTATE_T 1 2025-05-07T20:26:32.6403754Z #define __cpp_lib_logical_traits 201510 2025-05-07T20:26:32.6403937Z #define ADJ_OFFSET_SS_READ 0xa001 2025-05-07T20:26:32.6404102Z #define __warnattr(msg) __attribute__((__warning__ (msg))) 2025-05-07T20:26:32.6404260Z #define _PSTL_PRAGMA_LOCATION " [Parallel STL message]: " 2025-05-07T20:26:32.6404359Z #define _IO_funlockfile(_fp) 2025-05-07T20:26:32.6404706Z #define cudaKernelNodeAttributeAccessPolicyWindow cudaLaunchAttributeAccessPolicyWindow 2025-05-07T20:26:32.6404838Z #define M_2_PIl 0.636619772367581343075535053490057448L 2025-05-07T20:26:32.6404929Z #define __DRIVER_TYPES_H__ 2025-05-07T20:26:32.6405021Z #define __FLT_RADIX__ 2 2025-05-07T20:26:32.6405122Z #define __INT_LEAST16_TYPE__ short int 2025-05-07T20:26:32.6405293Z #define __LDBL_EPSILON__ 1.08420217248550443400745280086994171e-19L 2025-05-07T20:26:32.6405392Z #define __UINTMAX_C(c) c ## UL 2025-05-07T20:26:32.6405488Z #define _GLIBCXX_USE_LSTAT 1 2025-05-07T20:26:32.6405597Z #define minor(dev) gnu_dev_minor (dev) 2025-05-07T20:26:32.6405695Z #define _POSIX_C_SOURCE 200809L 2025-05-07T20:26:32.6405801Z #define _GLIBCXX_HAVE_DIRENT_H 1 2025-05-07T20:26:32.6405909Z #define __GLIBCXX_BITSIZE_INT_N_0 128 2025-05-07T20:26:32.6405991Z #define WORD_BIT 32 2025-05-07T20:26:32.6406075Z #define _IO_USER_BUF 1 2025-05-07T20:26:32.6406174Z #define __VECTOR_TYPES_H__ 2025-05-07T20:26:32.6406275Z #define __SM_20_ATOMIC_FUNCTIONS_HPP__ 2025-05-07T20:26:32.6406382Z #define cudaHostAllocPortable 0x01 2025-05-07T20:26:32.6406489Z #define PTHREAD_STACK_MIN 16384 2025-05-07T20:26:32.6406589Z #define __long_double_t long double 2025-05-07T20:26:32.6406681Z #define _GLIBCXX_HAVE_ISINF 1 2025-05-07T20:26:32.6406778Z #define _POSIX_ARG_MAX 4096 2025-05-07T20:26:32.6407201Z #define cudaKernelNodeAttributeDeviceUpdatableKernelNode cudaLaunchAttributeDeviceUpdatableKernelNode 2025-05-07T20:26:32.6407290Z #define __k8 1 2025-05-07T20:26:32.6407491Z #define _GLIBCXX_NO_OBSOLETE_ISINF_ISNAN_DYNAMIC __GLIBC_PREREQ(2,23) 2025-05-07T20:26:32.6407896Z #define __FLT32X_MIN__ 2.22507385850720138309023271733240406e-308F32x 2025-05-07T20:26:32.6408037Z #define __LDBL_REDIR(name,proto) name proto 2025-05-07T20:26:32.6408137Z #define __SIG_ATOMIC_MAX__ 0x7fffffff 2025-05-07T20:26:32.6408235Z #define __SM_30_INTRINSICS_HPP__ 2025-05-07T20:26:32.6408339Z #define _GLIBCXX_EXTERN_TEMPLATE 1 2025-05-07T20:26:32.6408433Z #define __blksize_t_defined 2025-05-07T20:26:32.6408527Z #define _IO_SHOWPOINT 0400 2025-05-07T20:26:32.6408630Z #define _GLIBCXX_HAVE_LIMIT_RSS 1 2025-05-07T20:26:32.6408741Z #define cudaDeviceLmemResizeToMax 0x10 2025-05-07T20:26:32.6408843Z #define _GLIBCXX_X86_RDRAND 1 2025-05-07T20:26:32.6408949Z #define __GCC_ATOMIC_WCHAR_T_LOCK_FREE 2 2025-05-07T20:26:32.6409043Z #define _IO_IS_FILEBUF 0x2000 2025-05-07T20:26:32.6409146Z #define _GLIBCXX_USE_DUAL_ABI 1 2025-05-07T20:26:32.6409415Z #define __bswap_constant_16(x) ((unsigned short int) ((((x) >> 8) & 0xff) | (((x) & 0xff) << 8))) 2025-05-07T20:26:32.6409779Z #define cudaSignalExternalSemaphoresAsync __CUDART_API_PTSZ(cudaSignalExternalSemaphoresAsync_v2) 2025-05-07T20:26:32.6409984Z #define UCHAR_MAX (SCHAR_MAX * 2 + 1) 2025-05-07T20:26:32.6410084Z #define __SIZEOF_PTRDIFF_T__ 8 2025-05-07T20:26:32.6410166Z #define SEEK_SET 0 2025-05-07T20:26:32.6410271Z #define _GLIBCXX_TR1_GAMMA_TCC 1 2025-05-07T20:26:32.6410365Z #define __CUDA_API_VER_MINOR__ 8 2025-05-07T20:26:32.6410573Z #define _GLIBCXX_VISIBILITY(V) __attribute__ ((__visibility__ (#V))) 2025-05-07T20:26:32.6410674Z #define __cudaCDP2GetLastError 2025-05-07T20:26:32.6410769Z #define _GLIBCXX_HAVE_COSL 1 2025-05-07T20:26:32.6410869Z #define _MATH_H_MATHDEF 1 2025-05-07T20:26:32.6411210Z #define __bswap_constant_32(x) ((((x) & 0xff000000) >> 24) | (((x) & 0x00ff0000) >> 8) | (((x) & 0x0000ff00) << 8) | (((x) & 0x000000ff) << 24)) 2025-05-07T20:26:32.6411309Z #define _GLIBCXX_USE_FLOAT128 1 2025-05-07T20:26:32.6411565Z #define _IO_FLAGS2_NOTCANCEL 2 2025-05-07T20:26:32.6411655Z #define __stub_sigreturn 2025-05-07T20:26:32.6411911Z #define __errordecl(name,msg) extern void name (void) __attribute__((__error__ (msg))) 2025-05-07T20:26:32.6412100Z #define _GLIBCXX_HAVE_UTIME_H 1 2025-05-07T20:26:32.6412188Z #define __HOST_CONFIG_H__ 2025-05-07T20:26:32.6412296Z #define _XOPEN_SOURCE_EXTENDED 1 2025-05-07T20:26:32.6412382Z #define CLOCK_TAI 11 2025-05-07T20:26:32.6412489Z #define _GLIBCXX_END_NAMESPACE_VERSION 2025-05-07T20:26:32.6412709Z #define __glibcxx_requires_sorted_set_pred(_First1,_Last1,_First2,_Pred) 2025-05-07T20:26:32.6412798Z #define __restrict_arr 2025-05-07T20:26:32.6412908Z #define _PSTL_PRAGMA_MESSAGE_POLICIES(x) 2025-05-07T20:26:32.6413056Z #define __glibcxx_requires_valid_range(_First,_Last) 2025-05-07T20:26:32.6413626Z #define strndupa(s,n) (__extension__ ({ const char *__old = (s); size_t __len = strnlen (__old, (n)); char *__new = (char *) __builtin_alloca (__len + 1); __new[__len] = '\0'; (char *) memcpy (__new, __old, __len); })) 2025-05-07T20:26:32.6413821Z #define __attribute_artificial__ __attribute__ ((__artificial__)) 2025-05-07T20:26:32.6413911Z #define __USE_MISC 1 2025-05-07T20:26:32.6414028Z #define __UWORD_TYPE unsigned long int 2025-05-07T20:26:32.6414134Z #define _EXCEPTION_DEFINES_H 1 2025-05-07T20:26:32.6414224Z #define _GCC_LIMITS_H_ 2025-05-07T20:26:32.6414311Z #define __LDBL_DIG__ 18 2025-05-07T20:26:32.6414414Z #define __BIT_TYPES_DEFINED__ 1 2025-05-07T20:26:32.6414512Z #define __malloc_and_calloc_defined 2025-05-07T20:26:32.6414605Z #define __FLT64_IS_IEC_60559__ 2 2025-05-07T20:26:32.6414715Z #define _GLIBCXX_HAVE_SYS_SYSINFO_H 1 2025-05-07T20:26:32.6414796Z #define __x86_64__ 1 2025-05-07T20:26:32.6414877Z #define _SIZE_T_ 2025-05-07T20:26:32.6415879Z #define __bswap_constant_64(x) (__extension__ ((((x) & 0xff00000000000000ull) >> 56) | (((x) & 0x00ff000000000000ull) >> 40) | (((x) & 0x0000ff0000000000ull) >> 24) | (((x) & 0x000000ff00000000ull) >> 8) | (((x) & 0x00000000ff000000ull) << 8) | (((x) & 0x0000000000ff0000ull) << 24) | (((x) & 0x000000000000ff00ull) << 40) | (((x) & 0x00000000000000ffull) << 56))) 2025-05-07T20:26:32.6415987Z #define _POSIX2_COLL_WEIGHTS_MAX 2 2025-05-07T20:26:32.6416096Z #define __FLT32X_MIN_EXP__ (-1021) 2025-05-07T20:26:32.6416210Z #define __PTHREAD_RWLOCK_INT_FLAGS_SHARED 1 2025-05-07T20:26:32.6416326Z #define __DEC32_SUBNORMAL_MIN__ 0.000001E-95DF 2025-05-07T20:26:32.6416426Z #define _IO_iconv_t _G_iconv_t 2025-05-07T20:26:32.6416534Z #define _GLIBCXX_FLOAT_IS_IEEE_BINARY32 1 2025-05-07T20:26:32.6416660Z #define __cpp_lib_make_reverse_iterator 201402 2025-05-07T20:26:32.6416799Z #define _GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(A) 2025-05-07T20:26:32.6416896Z #define _GLIBCXX_HAVE_DLFCN_H 1 2025-05-07T20:26:32.6417400Z #define strdupa(s) (__extension__ ({ const char *__old = (s); size_t __len = strlen (__old) + 1; char *__new = (char *) __builtin_alloca (__len); (char *) memcpy (__new, __old, __len); })) 2025-05-07T20:26:32.6417522Z #define __no_return__ __attribute__((noreturn)) 2025-05-07T20:26:32.6417668Z #define __device_builtin__ __location__(device_builtin) 2025-05-07T20:26:32.6417774Z #define _PSTL_HIDE_FROM_ABI_POP 2025-05-07T20:26:32.6417983Z #define _GLIBCXX_HAVE_ACOSF 1 2025-05-07T20:26:32.6418069Z #define STA_FLL 0x0008 2025-05-07T20:26:32.6418218Z #define _GLIBCXX_HAVE_BUILTIN_IS_CONSTANT_EVALUATED 1 2025-05-07T20:26:32.6418313Z #define _GLIBCXX_END_EXTERN_C } 2025-05-07T20:26:32.6418439Z #define __INT_FAST16_MAX__ 0x7fffffffffffffffL 2025-05-07T20:26:32.6418548Z #define __cpp_lib_integer_sequence 201304 2025-05-07T20:26:32.6418633Z #define __stub_revoke 2025-05-07T20:26:32.6418731Z #define __timer_t_defined 1 2025-05-07T20:26:32.6418863Z #define _GLIBCXX11_DEPRECATED _GLIBCXX_DEPRECATED 2025-05-07T20:26:32.6418952Z #define INT_MAX __INT_MAX__ 2025-05-07T20:26:32.6419066Z #define ULLONG_MAX (LLONG_MAX * 2ULL + 1) 2025-05-07T20:26:32.6419171Z #define _GLIBCXX_END_NAMESPACE_CXX11 } 2025-05-07T20:26:32.6419266Z #define _GLIBCXX_ICONV_CONST 2025-05-07T20:26:32.6419374Z #define major(dev) gnu_dev_major (dev) 2025-05-07T20:26:32.6419484Z #define cudaArrayTextureGather 0x08 2025-05-07T20:26:32.6419593Z #define _GLIBCXX_LT_OBJDIR ".libs/" 2025-05-07T20:26:32.6419819Z #define __inline_hint__ __attribute__((nv_inline_hint)) 2025-05-07T20:26:32.6419913Z #define __NV_LEGACY_LAUNCH 1 2025-05-07T20:26:32.6420007Z #define _IO_off_t __off_t 2025-05-07T20:26:32.6420093Z #define __FLT64_DIG__ 15 2025-05-07T20:26:32.6420321Z #define PTHREAD_DESTRUCTOR_ITERATIONS _POSIX_THREAD_DESTRUCTOR_ITERATIONS 2025-05-07T20:26:32.6420424Z #define _POSIX2_LINE_MAX 2048 2025-05-07T20:26:32.6420551Z #define __UINT_FAST32_MAX__ 0xffffffffffffffffUL 2025-05-07T20:26:32.6420672Z #define __UINT_LEAST64_TYPE__ long unsigned int 2025-05-07T20:26:32.6420773Z #define ADJ_FREQUENCY 0x0002 2025-05-07T20:26:32.6420877Z #define __CUDART_API_PTDS(api) api 2025-05-07T20:26:32.6420960Z #define NULL __null 2025-05-07T20:26:32.6421095Z #define cudaStreamPerThread ((cudaStream_t)0x2) 2025-05-07T20:26:32.6421197Z #define _GLIBCXX_CONSTEXPR constexpr 2025-05-07T20:26:32.6421301Z #define __U64_TYPE unsigned long int 2025-05-07T20:26:32.6421394Z #define __FLT_HAS_QUIET_NAN__ 1 2025-05-07T20:26:32.6421497Z #define __FLT_MAX_10_EXP__ 38 2025-05-07T20:26:32.6421589Z #define FP_ZERO 2 2025-05-07T20:26:32.6421684Z #define _GLIBCXX_HAVE_FLOORL 1 2025-05-07T20:26:32.6421838Z #define __isgraph_l(c,l) __isctype_l((c), _ISgraph, (l)) 2025-05-07T20:26:32.6421951Z #define __LONG_MAX__ 0x7fffffffffffffffL 2025-05-07T20:26:32.6422031Z #define __WCHAR_T__ 2025-05-07T20:26:32.6422126Z #define __FLT64X_HAS_DENORM__ 1 2025-05-07T20:26:32.6422358Z #define __DEC128_SUBNORMAL_MIN__ 0.000000000000000000000000000000001E-6143DL 2025-05-07T20:26:32.6422511Z #define _GLIBCXX_NORETURN __attribute__ ((__noreturn__)) 2025-05-07T20:26:32.6422616Z #define __FLT_HAS_INFINITY__ 1 2025-05-07T20:26:32.6422734Z #define __GNUC_EXECUTION_CHARSET_NAME "UTF-8" 2025-05-07T20:26:32.6436321Z #define _GLIBCXX20_DEPRECATED_SUGGEST(ALT) 2025-05-07T20:26:32.6436565Z #define __WSTOPSIG(status) __WEXITSTATUS(status) 2025-05-07T20:26:32.6436734Z #define cudaSurfaceTypeCubemapLayered 0xFC 2025-05-07T20:26:32.6436889Z #define _BSD_PTRDIFF_T_ 2025-05-07T20:26:32.6437022Z #define _SIGSET_H_types 1 2025-05-07T20:26:32.6437184Z #define cudaTextureType1DLayered 0xF1 2025-05-07T20:26:32.6437335Z #define __cpp_unicode_literals 200710L 2025-05-07T20:26:32.6437533Z #define __isdigit_l(c,l) __isctype_l((c), _ISdigit, (l)) 2025-05-07T20:26:32.6437635Z #define __LONG_LONG_PAIR(HI,LO) LO, HI 2025-05-07T20:26:32.6437763Z #define __UINT_FAST16_TYPE__ long unsigned int 2025-05-07T20:26:32.6437894Z #define __bos0(ptr) __builtin_object_size (ptr, 0) 2025-05-07T20:26:32.6438009Z #define __DEC64_MAX__ 9.999999999999999E384DD 2025-05-07T20:26:32.6438139Z #define M_1_PIl 0.318309886183790671537767526745028724L 2025-05-07T20:26:32.6438248Z #define __CUDACC_DEVICE_ATOMIC_BUILTINS__ 1 2025-05-07T20:26:32.6438430Z #define WIFSTOPPED(status) __WIFSTOPPED (__WAIT_INT (status)) 2025-05-07T20:26:32.6438523Z #define __INT_FAST32_WIDTH__ 64 2025-05-07T20:26:32.6438626Z #define _POSIX2_CHARCLASS_NAME_MAX 14 2025-05-07T20:26:32.6438728Z #define _GLIBCXX_BITS_STD_ABS_H 2025-05-07T20:26:32.6439003Z #define STA_MODE 0x4000 2025-05-07T20:26:32.6439114Z #define __CHAR16_TYPE__ short unsigned int 2025-05-07T20:26:32.6439218Z #define __PRAGMA_REDEFINE_EXTNAME 1 2025-05-07T20:26:32.6439333Z #define __glibcxx_signed_b(T,B) ((T)(-1) < 0) 2025-05-07T20:26:32.6439432Z #define __USING_NAMESPACE_C99(name) 2025-05-07T20:26:32.6439532Z #define BIG_ENDIAN __BIG_ENDIAN 2025-05-07T20:26:32.6439643Z #define __cudaCDP2EventRecord_ptsz 2025-05-07T20:26:32.6439749Z #define _GLIBCXX_HAVE_SINL 1 2025-05-07T20:26:32.6439860Z #define EXPR_NEST_MAX _POSIX2_EXPR_NEST_MAX 2025-05-07T20:26:32.6439946Z #define __SIZE_WIDTH__ 64 2025-05-07T20:26:32.6440399Z #define __BLKSIZE_T_TYPE __SYSCALL_SLONG_TYPE 2025-05-07T20:26:32.6440483Z #define __SEG_FS 1 2025-05-07T20:26:32.6440572Z #define _IO_size_t size_t 2025-05-07T20:26:32.6440676Z #define __INT_LEAST16_MAX__ 0x7fff 2025-05-07T20:26:32.6440773Z #define INT_MIN (-INT_MAX - 1) 2025-05-07T20:26:32.6440856Z #define __stub_lchmod 2025-05-07T20:26:32.6441054Z #define __DEC64_MANT_DIG__ 16 2025-05-07T20:26:32.6441162Z #define __INT64_MAX__ 0x7fffffffffffffffL 2025-05-07T20:26:32.6441257Z #define _GLIBCXX_MANGLE_SIZE_T m 2025-05-07T20:26:32.6441347Z #define __SEG_GS 1 2025-05-07T20:26:32.6441536Z #define __FLT32_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F32 2025-05-07T20:26:32.6441630Z #define _IOS_APPEND 8 2025-05-07T20:26:32.6441722Z #define __SIG_ATOMIC_WIDTH__ 32 2025-05-07T20:26:32.6441811Z #define _GLIBCXX_RELEASE 11 2025-05-07T20:26:32.6441913Z #define _GLIBCXX98_USE_C99_WCHAR 1 2025-05-07T20:26:32.6442010Z #define _IO_IS_APPENDING 0x1000 2025-05-07T20:26:32.6442109Z #define __INT_LEAST64_TYPE__ long int 2025-05-07T20:26:32.6442198Z #define htole16(x) (x) 2025-05-07T20:26:32.6442304Z #define __TEXTURE_INDIRECT_FUNCTIONS_H__ 2025-05-07T20:26:32.6442395Z #define _GLIBCXX_HAVE_FCNTL_H 1 2025-05-07T20:26:32.6442494Z #define __INT16_TYPE__ short int 2025-05-07T20:26:32.6442592Z #define __INT_LEAST8_TYPE__ signed char 2025-05-07T20:26:32.6442716Z #define __glibcxx_class_requires(_a,_b) 2025-05-07T20:26:32.6442825Z #define __cpp_structured_bindings 201606L 2025-05-07T20:26:32.6442953Z #define __align__(n) __attribute__((aligned(n))) 2025-05-07T20:26:32.6443049Z #define __SIZEOF_INT__ 4 2025-05-07T20:26:32.6443138Z #define __WCLONE 0x80000000 2025-05-07T20:26:32.6443230Z #define __DEC32_MAX_EXP__ 97 2025-05-07T20:26:32.6443317Z #define SEEK_HOLE 4 2025-05-07T20:26:32.6443402Z #define TIMER_ABSTIME 1 2025-05-07T20:26:32.6443495Z #define __INT_FAST8_MAX__ 0x7f 2025-05-07T20:26:32.6443588Z #define __CUDA_MATH_CRTIMP 2025-05-07T20:26:32.6443767Z #define __FLT128_MAX__ 1.18973149535723176508575932662800702e+4932F128 2025-05-07T20:26:32.6443878Z #define __INTPTR_MAX__ 0x7fffffffffffffffL 2025-05-07T20:26:32.6444114Z #define __DRIVER_FUNCTIONS_H__ 2025-05-07T20:26:32.6444225Z #define __cpp_sized_deallocation 201309L 2025-05-07T20:26:32.6444327Z #define __MATH_FUNCTIONS_HPP__ 2025-05-07T20:26:32.6444447Z #define __cpp_guaranteed_copy_elision 201606L 2025-05-07T20:26:32.6444545Z #define _LINUX_LIMITS_H 2025-05-07T20:26:32.6444631Z #define linux 1 2025-05-07T20:26:32.6444721Z #define MOD_MICRO ADJ_MICRO 2025-05-07T20:26:32.6444829Z #define _GLIBCXX_DEBUG_ASSERT(_Condition) 2025-05-07T20:26:32.6444936Z #define _GLIBCXX_HAVE_VSWSCANF 1 2025-05-07T20:26:32.6445026Z #define _GLIBCXX_HAVE_ISNAN 1 2025-05-07T20:26:32.6445131Z #define _XOPEN_IOV_MAX _POSIX_UIO_MAXIOV 2025-05-07T20:26:32.6445281Z #define __cudart_builtin__ __location__(cudart_builtin) 2025-05-07T20:26:32.6445377Z #define __cpp_lib_hypot 201603 2025-05-07T20:26:32.6445470Z #define __FLT64_HAS_QUIET_NAN__ 1 2025-05-07T20:26:32.6445571Z #define _GLIBCXX_HAVE_WCTYPE_H 1 2025-05-07T20:26:32.6445658Z #define MOD_NANO ADJ_NANO 2025-05-07T20:26:32.6445747Z #define htole64(x) (x) 2025-05-07T20:26:32.6445846Z #define FP_ILOGBNAN (-2147483647 - 1) 2025-05-07T20:26:32.6445978Z #define _IO_stdout ((_IO_FILE*)(&_IO_2_1_stdout_)) 2025-05-07T20:26:32.6446115Z #define _IO_UPPERCASE 01000 2025-05-07T20:26:32.6446827Z #define cudaKernelNodeAttributeClusterSchedulingPolicyPreference cudaLaunchAttributeClusterSchedulingPolicyPreference 2025-05-07T20:26:32.6446923Z #define __USE_POSIX2 1 2025-05-07T20:26:32.6447031Z #define MOD_ESTERROR ADJ_ESTERROR 2025-05-07T20:26:32.6447116Z #define __WALL 0x40000000 2025-05-07T20:26:32.6447212Z #define _GLIBCXX_HAVE_LDEXPF 1 2025-05-07T20:26:32.6447309Z #define _XLOCALE_H 1 2025-05-07T20:26:32.6447442Z #define _GLIBCXX_USE_TMPNAM 1 2025-05-07T20:26:32.6447579Z #define __FLT32_MIN_10_EXP__ (-37) 2025-05-07T20:26:32.6447675Z #define __KEY_T_TYPE __S32_TYPE 2025-05-07T20:26:32.6447779Z #define __cudaGet_threadIdx() threadIdx 2025-05-07T20:26:32.6447873Z #define __EXCEPTIONS 1 2025-05-07T20:26:32.6447974Z #define __CUDART_API_PTSZ(api) api 2025-05-07T20:26:32.6448174Z #define __launch_bounds__(...) __annotate__(launch_bounds(__VA_ARGS__)) 2025-05-07T20:26:32.6448267Z #define __WORDSIZE 64 2025-05-07T20:26:32.6448359Z #define CLOCK_MONOTONIC 1 2025-05-07T20:26:32.6448448Z #define _STL_RELOPS_H 1 2025-05-07T20:26:32.6448677Z #define __PTRDIFF_WIDTH__ 64 2025-05-07T20:26:32.6448776Z #define __BEGIN_DECLS extern "C" { 2025-05-07T20:26:32.6448876Z #define _GLIBCXX_HAVE_SYS_IPC_H 1 2025-05-07T20:26:32.6448974Z #define __LDBL_MANT_DIG__ 64 2025-05-07T20:26:32.6449072Z #define _GLIBCXX_HAVE_TRUNCATE 1 2025-05-07T20:26:32.6449390Z #define cudaKernelNodeAttributeClusterDimension cudaLaunchAttributeClusterDimension 2025-05-07T20:26:32.6449628Z #define _PSTL_GCC_VERSION (__GNUC__ * 10000 + __GNUC_MINOR__ * 100 + __GNUC_PATCHLEVEL__) 2025-05-07T20:26:32.6449765Z #define _GLIBCXX_NAMESPACE_CXX11 __cxx11:: 2025-05-07T20:26:32.6449871Z #define _GLIBCXX_NUMERIC_LIMITS 1 2025-05-07T20:26:32.6449971Z #define __cpp_range_based_for 201603L 2025-05-07T20:26:32.6450081Z #define __cpp_lib_exchange_function 201304 2025-05-07T20:26:32.6450190Z #define _GLIBCXX_HAVE_INTTYPES_H 1 2025-05-07T20:26:32.6450299Z #define _GLIBCXX_DARWIN_USE_64_BIT_INODE 1 2025-05-07T20:26:32.6450483Z #define cudaCooperativeLaunchMultiDeviceNoPostSync 0x02 2025-05-07T20:26:32.6450595Z #define __FLT64_HAS_INFINITY__ 1 2025-05-07T20:26:32.6450687Z #define _GLIBCXX_CSTDLIB 1 2025-05-07T20:26:32.6450796Z #define _GLIBCXX_DEBUG_MACRO_SWITCH_H 1 2025-05-07T20:26:32.6450976Z #define __FLT64X_MAX__ 1.18973149535723176502126385303097021e+4932F64x 2025-05-07T20:26:32.6451091Z #define __STDCPP_DEFAULT_NEW_ALIGNMENT__ 16 2025-05-07T20:26:32.6451181Z #define _STRING_H 1 2025-05-07T20:26:32.6451280Z #define _BITS_PTHREADTYPES_H 1 2025-05-07T20:26:32.6451370Z #define _GCC_MAX_ALIGN_T 2025-05-07T20:26:32.6451473Z #define __SM_32_INTRINSICS_HPP__ 2025-05-07T20:26:32.6451606Z #define __SIG_ATOMIC_MIN__ (-__SIG_ATOMIC_MAX__ - 1) 2025-05-07T20:26:32.6451701Z #define __code_model_small__ 1 2025-05-07T20:26:32.6451793Z #define _PSTL_CONFIG_H 2025-05-07T20:26:32.6451893Z #define __GCC_ATOMIC_LONG_LOCK_FREE 2 2025-05-07T20:26:32.6452009Z #define __cpp_nontype_template_args 201411L 2025-05-07T20:26:32.6452104Z #define __SM_20_INTRINSICS_H__ 2025-05-07T20:26:32.6452211Z #define cudaCpuDeviceId ((int)-1) 2025-05-07T20:26:32.6452574Z #define assert(expr) ((expr) ? __ASSERT_VOID_CAST (0) : __assert_fail (__STRING(expr), __FILE__, __LINE__, __ASSERT_FUNCTION)) 2025-05-07T20:26:32.6452668Z #define __DEC32_MANT_DIG__ 7 2025-05-07T20:26:32.6452754Z #define le64toh(x) (x) 2025-05-07T20:26:32.6452851Z #define FILENAME_MAX 4096 2025-05-07T20:26:32.6453004Z #define __iscntrl_l(c,l) __isctype_l((c), _IScntrl, (l)) 2025-05-07T20:26:32.6453119Z #define __cpp_return_type_deduction 201304L 2025-05-07T20:26:32.6453211Z #define L_cuserid 9 2025-05-07T20:26:32.6453298Z #define __ino_t_defined 2025-05-07T20:26:32.6453379Z #define __k8__ 1 2025-05-07T20:26:32.6453482Z #define __INTPTR_TYPE__ long int 2025-05-07T20:26:32.6453591Z #define __UINT16_TYPE__ short unsigned int 2025-05-07T20:26:32.6453683Z #define __int8_t_defined 2025-05-07T20:26:32.6453773Z #define __WCHAR_TYPE__ int 2025-05-07T20:26:32.6453871Z #define __CLOCKID_T_TYPE __S32_TYPE 2025-05-07T20:26:32.6453989Z #define cudaHostRegisterPortable 0x01 2025-05-07T20:26:32.6454177Z #define __SLONGWORD_TYPE long int 2025-05-07T20:26:32.6454297Z #define _GLIBCXX_PACKAGE_TARNAME "libstdc++" 2025-05-07T20:26:32.6454454Z #define __isblank_l(c,l) __isctype_l((c), _ISblank, (l)) 2025-05-07T20:26:32.6454538Z #define __HAVE_COLUMN 2025-05-07T20:26:32.6454622Z #define __stub_fdetach 2025-05-07T20:26:32.6455064Z #define __CUDACC_VER__ "__CUDACC_VER__ is no longer supported. Use __CUDACC_VER_MAJOR__, __CUDACC_VER_MINOR__, and __CUDACC_VER_BUILD__ instead." 2025-05-07T20:26:32.6455144Z #define __pic__ 2 2025-05-07T20:26:32.6455272Z #define __UINTPTR_MAX__ 0xffffffffffffffffUL 2025-05-07T20:26:32.6455368Z #define CLOCKS_PER_SEC 1000000l 2025-05-07T20:26:32.6455462Z #define __INT_FAST64_WIDTH__ 64 2025-05-07T20:26:32.6455570Z #define _GLIBCXX_HAVE_SOCKATMARK 1 2025-05-07T20:26:32.6455654Z #define __stub_chflags 2025-05-07T20:26:32.6455743Z #define CLOCK_BOOTTIME 7 2025-05-07T20:26:32.6455836Z #define __need_IOV_MAX 2025-05-07T20:26:32.6456026Z #define putc(_ch,_fp) _IO_putc (_ch, _fp) 2025-05-07T20:26:32.6456132Z #define __UQUAD_TYPE unsigned long int 2025-05-07T20:26:32.6456236Z #define __cpp_decltype 200707L 2025-05-07T20:26:32.6456332Z #define __BYTE_ORDER __LITTLE_ENDIAN 2025-05-07T20:26:32.6456423Z #define _GLIBCXX_USE_C99 1 2025-05-07T20:26:32.6456535Z #define _GLIBCXX_TR1_BETA_FUNCTION_TCC 1 2025-05-07T20:26:32.6456621Z #define TTY_NAME_MAX 32 2025-05-07T20:26:32.6456796Z #define _GLIBCXX_FORWARD(_Tp,__val) std::forward<_Tp>(__val) 2025-05-07T20:26:32.6456917Z #define __INT_FAST64_MAX__ 0x7fffffffffffffffL 2025-05-07T20:26:32.6457086Z #define _PSTL_ASSERT(_Condition) __glibcxx_assert(_Condition) 2025-05-07T20:26:32.6457199Z #define __GCC_ATOMIC_TEST_AND_SET_TRUEVAL 1 2025-05-07T20:26:32.6457291Z #define __LITTLE_ENDIAN 1234 2025-05-07T20:26:32.6457384Z #define STA_PPSTIME 0x0004 2025-05-07T20:26:32.6457473Z #define __import__ 2025-05-07T20:26:32.6457563Z #define BUFSIZ _IO_BUFSIZ 2025-05-07T20:26:32.6457703Z #define M_SQRT2l 1.414213562373095048801688724209698079L 2025-05-07T20:26:32.6457799Z #define __export__ 2025-05-07T20:26:32.6457918Z #define __FSID_T_TYPE struct { int __val[2]; } 2025-05-07T20:26:32.6458025Z #define cudaMemAttachHost 0x02 2025-05-07T20:26:32.6458190Z #define __FLT_NORM_MAX__ 3.40282346638528859811704183484516925e+38F 2025-05-07T20:26:32.6458285Z #define _GLIBCXX_HAVE_ICONV 1 2025-05-07T20:26:32.6458376Z #define _GLIBCXX_SYMVER 1 2025-05-07T20:26:32.6458470Z #define __FLT64X_MAX_EXP__ 16384 2025-05-07T20:26:32.6458560Z #define _WCHAR_T_DECLARED 2025-05-07T20:26:32.6458687Z #define __UINT_FAST64_TYPE__ long unsigned int 2025-05-07T20:26:32.6458804Z #define isalpha_l(c,l) __isalpha_l ((c), (l)) 2025-05-07T20:26:32.6458908Z #define __cpp_inline_variables 201606L 2025-05-07T20:26:32.6459004Z #define WNOWAIT 0x01000000 2025-05-07T20:26:32.6459087Z #define PLOSS 6 2025-05-07T20:26:32.6459181Z #define M_LN10 2.30258509299404568402 2025-05-07T20:26:32.6459471Z #define _PSTL_UDS_PRESENT (__INTEL_COMPILER >= 1900 && __INTEL_COMPILER_BUILD_DATE >= 20180626) 2025-05-07T20:26:32.6459568Z #define EXIT_SUCCESS 0 2025-05-07T20:26:32.6459671Z #define __LDBL_REDIR_DECL(name) 2025-05-07T20:26:32.6459766Z #define _GLIBCXX_HAVE_STRTOF 1 2025-05-07T20:26:32.6459868Z #define MOD_FREQUENCY ADJ_FREQUENCY 2025-05-07T20:26:32.6459963Z #define __thread__ __thread 2025-05-07T20:26:32.6460057Z #define _GLIBCXX_HAVE_MEMORY_H 1 2025-05-07T20:26:32.6460150Z #define __INT_MAX__ 0x7fffffff 2025-05-07T20:26:32.6460258Z #define __SIZEOF_PTHREAD_BARRIER_T 32 2025-05-07T20:26:32.6460491Z #define __glibcxx_requires_partitioned_upper_pred(_First,_Last,_Value,_Pred) 2025-05-07T20:26:32.6460600Z #define __cudaCDP2StreamWaitEvent_ptsz 2025-05-07T20:26:32.6460701Z #define _GLIBCXX_HAVE_SINF 1 2025-05-07T20:26:32.6460784Z #define __linux__ 1 2025-05-07T20:26:32.6460886Z #define STA_PPSSIGNAL 0x0100 2025-05-07T20:26:32.6461013Z #define M_LN2l 0.693147180559945309417232121458176568L 2025-05-07T20:26:32.6461108Z #define __S16_TYPE short int 2025-05-07T20:26:32.6461578Z #define __glibcxx_constexpr_assert(cond) if (__builtin_is_constant_evaluated() && !bool(cond)) __builtin_unreachable() 2025-05-07T20:26:32.6461688Z #define __NVCC_DIAG_PRAGMA_SUPPORT__ 1 2025-05-07T20:26:32.6461882Z #define __bos(ptr) __builtin_object_size (ptr, __USE_FORTIFY_LEVEL > 1) 2025-05-07T20:26:32.6461988Z #define __COMMON_FUNCTIONS_H__ 2025-05-07T20:26:32.6462086Z #define UINT_MAX (INT_MAX * 2U + 1U) 2025-05-07T20:26:32.6462167Z #define _T_SIZE_ 2025-05-07T20:26:32.6462270Z #define LLONG_MAX __LONG_LONG_MAX__ 2025-05-07T20:26:32.6462388Z #define __cudaCDP2StreamCreateWithFlags 2025-05-07T20:26:32.6462488Z #define _PSTL_VERSION 12000 2025-05-07T20:26:32.6462608Z #define __noinline__ __attribute__((noinline)) 2025-05-07T20:26:32.6462702Z #define __WNOTHREAD 0x20000000 2025-05-07T20:26:32.6462804Z #define _G_va_list __gnuc_va_list 2025-05-07T20:26:32.6462933Z #define M_PI_4l 0.785398163397448309615660845819875721L 2025-05-07T20:26:32.6463018Z #define _IOS_INPUT 1 2025-05-07T20:26:32.6463114Z #define __USE_LARGEFILE64 1 2025-05-07T20:26:32.6463302Z #define _GLIBCXX_TR1_EXP_INTEGRAL_TCC 1 2025-05-07T20:26:32.6463394Z #define __INT64_TYPE__ long int 2025-05-07T20:26:32.6463499Z #define _POSIX_SSIZE_MAX 32767 2025-05-07T20:26:32.6463598Z #define __shared__ __location__(shared) 2025-05-07T20:26:32.6463688Z #define __FLT_MAX_EXP__ 128 2025-05-07T20:26:32.6463850Z #define __glibc_unlikely(cond) __builtin_expect((cond), 0) 2025-05-07T20:26:32.6463936Z #define __gid_t_defined 2025-05-07T20:26:32.6464053Z #define _GLIBCXX_USE_SC_NPROCESSORS_ONLN 1 2025-05-07T20:26:32.6464148Z #define __ORDER_BIG_ENDIAN__ 4321 2025-05-07T20:26:32.6464350Z #define __glibcxx_requires_can_increment_range(_First1,_Last1,_First2) 2025-05-07T20:26:32.6464450Z #define _GLIBCXX17_INLINE inline 2025-05-07T20:26:32.6464540Z #define __DBL_MANT_DIG__ 53 2025-05-07T20:26:32.6464627Z #define ___int_size_t_h 2025-05-07T20:26:32.6464739Z #define __FSBLKCNT64_T_TYPE __UQUAD_TYPE 2025-05-07T20:26:32.6464860Z #define __cpp_inheriting_constructors 201511L 2025-05-07T20:26:32.6465026Z #define __WIFCONTINUED(status) ((status) == __W_CONTINUED) 2025-05-07T20:26:32.6465130Z #define CUDA_DOUBLE_MATH_FUNCTIONS 1 2025-05-07T20:26:32.6465224Z #define _GLIBCXX_HAVE_FENV_H 1 2025-05-07T20:26:32.6465329Z #define _GLIBCXX_HAVE_STDBOOL_H 1 2025-05-07T20:26:32.6465420Z #define __SIZEOF_FLOAT128__ 16 2025-05-07T20:26:32.6465544Z #define __INT_LEAST64_MAX__ 0x7fffffffffffffffL 2025-05-07T20:26:32.6465661Z #define _GLIBCXX_TR1_HYPERGEOMETRIC_TCC 1 2025-05-07T20:26:32.6465778Z #define _GLIBCXX_DEBUG_PEDASSERT(_Condition) 2025-05-07T20:26:32.6465870Z #define __clock_t_defined 1 2025-05-07T20:26:32.6465974Z #define _POSIX_SEM_VALUE_MAX 32767 2025-05-07T20:26:32.6466081Z #define __cudaCDP2RuntimeGetVersion 2025-05-07T20:26:32.6466169Z #define __GLIBC_MINOR__ 17 2025-05-07T20:26:32.6466266Z #define __DEC64_MIN__ 1E-383DD 2025-05-07T20:26:32.6466362Z #define __WINT_TYPE__ unsigned int 2025-05-07T20:26:32.6466473Z #define __UINT_LEAST32_TYPE__ unsigned int 2025-05-07T20:26:32.6466578Z #define __SIZEOF_SHORT__ 2 2025-05-07T20:26:32.6466782Z #define __FLT32_NORM_MAX__ 3.40282346638528859811704183484516925e+38F32 2025-05-07T20:26:32.6466880Z #define __SSE__ 1 2025-05-07T20:26:32.6466976Z #define SEM_VALUE_MAX (2147483647) 2025-05-07T20:26:32.6467068Z #define M_SQRT1_2 0.70710678118654752440 2025-05-07T20:26:32.6467157Z #define _CTYPE_H 1 2025-05-07T20:26:32.6467249Z #define __sigset_t_defined 2025-05-07T20:26:32.6467343Z #define __LDBL_MIN_EXP__ (-16381) 2025-05-07T20:26:32.6467445Z #define _GLIBCXX_HAVE_LOGF 1 2025-05-07T20:26:32.6467531Z #define MOD_TAI ADJ_TAI 2025-05-07T20:26:32.6467627Z #define _IO_va_list __gnuc_va_list 2025-05-07T20:26:32.6467724Z #define _GLIBCXX_HAVE_LOGL 1 2025-05-07T20:26:32.6467806Z #define __SM_70_RT_H__ 2025-05-07T20:26:32.6467904Z #define _GLIBCXX_HAVE_WRITEV 1 2025-05-07T20:26:32.6468006Z #define cudaEventWaitDefault 0x00 2025-05-07T20:26:32.6468101Z #define _GLIBCXX_HAVE_EXPL 1 2025-05-07T20:26:32.6468354Z #define __FLT64_MAX__ 1.79769313486231570814527423731704357e+308F64 2025-05-07T20:26:32.6468454Z #define _POSIX_MAX_CANON 255 2025-05-07T20:26:32.6468564Z #define _GLIBCXX_NOEXCEPT_PARM , bool _NE 2025-05-07T20:26:32.6468662Z #define FD_SETSIZE __FD_SETSIZE 2025-05-07T20:26:32.6468753Z #define _GLIBCXX_TXN_SAFE 2025-05-07T20:26:32.6468835Z #define __amd64__ 1 2025-05-07T20:26:32.6468926Z #define __WINT_WIDTH__ 32 2025-05-07T20:26:32.6469028Z #define __CUDA_DEVICE_RUNTIME_API_H__ 2025-05-07T20:26:32.6469306Z #define __REDIRECT_NTHNL(name,proto,alias) name proto __THROWNL __asm__ (__ASMNAME (#alias)) 2025-05-07T20:26:32.6469410Z #define _GLIBCXX_STDIO_SEEK_CUR 1 2025-05-07T20:26:32.6469489Z #define EOF (-1) 2025-05-07T20:26:32.6469594Z #define __WAIT_STATUS_DEFN void * 2025-05-07T20:26:32.6469828Z #define __USE_POSIX199309 1 2025-05-07T20:26:32.6469927Z #define __INT_LEAST64_WIDTH__ 64 2025-05-07T20:26:32.6470024Z #define __LDBL_MAX_EXP__ 16384 2025-05-07T20:26:32.6470116Z #define __FLT32X_MAX_10_EXP__ 308 2025-05-07T20:26:32.6470211Z #define LLONG_MIN (-LLONG_MAX-1) 2025-05-07T20:26:32.6470424Z #define cudaSurfaceType2DLayered 0xF2 2025-05-07T20:26:32.6470516Z #define ____mbstate_t_defined 1 2025-05-07T20:26:32.6470601Z #define STA_NANO 0x2000 2025-05-07T20:26:32.6470701Z #define _GLIBCXX_HAVE_LOG10F 1 2025-05-07T20:26:32.6470792Z #define _GLIBCXX_HAVE_LOG10L 1 2025-05-07T20:26:32.6470877Z #define _IO_LINKED 0x80 2025-05-07T20:26:32.6470979Z #define __cpp_lib_launder 201606 2025-05-07T20:26:32.6471069Z #define __SIZEOF_INT128__ 16 2025-05-07T20:26:32.6471177Z #define __PTHREAD_MUTEX_HAVE_PREV 1 2025-05-07T20:26:32.6471268Z #define __FLT64X_IS_IEC_60559__ 2 2025-05-07T20:26:32.6471362Z #define _GLIBCXX_TYPE_TRAITS 1 2025-05-07T20:26:32.6471510Z #define cudaGraphKernelNodePortProgrammatic 1 2025-05-07T20:26:32.6471614Z #define __DEVICE_ATOMIC_FUNCTIONS_HPP__ 2025-05-07T20:26:32.6471713Z #define __BLKCNT64_T_TYPE __SQUAD_TYPE 2025-05-07T20:26:32.6471822Z #define __LDBL_MAX_10_EXP__ 4932 2025-05-07T20:26:32.6471914Z #define __W_CONTINUED 0xffff 2025-05-07T20:26:32.6472013Z #define __ATOMIC_RELAXED 0 2025-05-07T20:26:32.6472149Z #define w_coredump __wait_terminated.__w_coredump 2025-05-07T20:26:32.6472269Z #define __FSBLKCNT_T_TYPE __SYSCALL_ULONG_TYPE 2025-05-07T20:26:32.6472480Z #define __cudaCDP2OccupancyMaxActiveBlocksPerMultiprocessor 2025-05-07T20:26:32.6472671Z #define __DBL_EPSILON__ double(2.22044604925031308084726333618164062e-16L) 2025-05-07T20:26:32.6472755Z #define __stub_stty 2025-05-07T20:26:32.6472932Z #define _tolower(c) ((int) (*__ctype_tolower_loc ())[(int) (c)]) 2025-05-07T20:26:32.6473016Z #define le16toh(x) (x) 2025-05-07T20:26:32.6473122Z #define BC_SCALE_MAX _POSIX2_BC_SCALE_MAX 2025-05-07T20:26:32.6473309Z #define __FLT128_MIN__ 3.36210314311209350626267781732175260e-4932F128 2025-05-07T20:26:32.6473390Z #define _SIZET_ 2025-05-07T20:26:32.6473480Z #define XATTR_NAME_MAX 255 2025-05-07T20:26:32.6473566Z #define _SVID_SOURCE 1 2025-05-07T20:26:32.6473646Z #define _LP64 1 2025-05-07T20:26:32.6473734Z #define _LIBC_LIMITS_H_ 1 2025-05-07T20:26:32.6473994Z #define __REDIRECT_NTH_LDBL(name,proto,alias) __REDIRECT_NTH (name, proto, alias) 2025-05-07T20:26:32.6474104Z #define _GLIBCXX_TR1_BESSEL_FUNCTION_TCC 1 2025-05-07T20:26:32.6474193Z #define __UINT8_C(c) c 2025-05-07T20:26:32.6474285Z #define _GLIBCXX_HAVE_CEILF 1 2025-05-07T20:26:32.6474376Z #define _GLIBCXX_HAVE_CEILL 1 2025-05-07T20:26:32.6474488Z #define __cudaCDP2Memset3DAsync_ptsz 2025-05-07T20:26:32.6474580Z #define __CUDA_ARCH_LIST__ 520 2025-05-07T20:26:32.6474673Z #define __FLT64_MAX_EXP__ 1024 2025-05-07T20:26:32.6474773Z #define MOD_MAXERROR ADJ_MAXERROR 2025-05-07T20:26:32.6474855Z #define CUDARTAPI 2025-05-07T20:26:32.6474935Z #define IOV_MAX 1024 2025-05-07T20:26:32.6475082Z #define __glibcxx_requires_irreflexive2(_First,_Last) 2025-05-07T20:26:32.6475176Z #define __INT_LEAST32_TYPE__ int 2025-05-07T20:26:32.6475281Z #define P_tmpdir "/tmp" 2025-05-07T20:26:32.6475382Z #define cudaMemAttachSingle 0x04 2025-05-07T20:26:32.6475463Z #define __wchar_t__ 2025-05-07T20:26:32.6475689Z #define __cpp_lib_is_aggregate 201703 2025-05-07T20:26:32.6475772Z #define SEEK_END 2 2025-05-07T20:26:32.6475861Z #define __SIZEOF_WCHAR_T__ 4 2025-05-07T20:26:32.6476041Z #define _GLIBCXX_USE_TBB_PAR_BACKEND __has_include() 2025-05-07T20:26:32.6476137Z #define _IO_ftrylockfile(_fp) 2025-05-07T20:26:32.6476281Z #define _GLIBCXX_USE_C99_WCHAR _GLIBCXX11_USE_C99_WCHAR 2025-05-07T20:26:32.6476375Z #define ____FILE_defined 1 2025-05-07T20:26:32.6476488Z #define _GLIBCXX_HAVE_BUILTIN_IS_AGGREGATE 1 2025-05-07T20:26:32.6476586Z #define __GNUC_PATCHLEVEL__ 0 2025-05-07T20:26:32.6476950Z #define _ISOC99_SOURCE 1 2025-05-07T20:26:32.6477047Z #define __VECTOR_FUNCTIONS_H__ 2025-05-07T20:26:32.6477470Z #define __REDIRECT_NTH(name,proto,alias) name proto __THROW __asm__ (__ASMNAME (#alias)) 2025-05-07T20:26:32.6477600Z #define _PSTL_USE_NONTEMPORAL_STORES_IF_ALLOWED 2025-05-07T20:26:32.6477681Z #define _IO_RIGHT 04 2025-05-07T20:26:32.6477782Z #define __END_NAMESPACE_STD 2025-05-07T20:26:32.6478072Z #define __FLT128_NORM_MAX__ 1.18973149535723176508575932662800702e+4932F128 2025-05-07T20:26:32.6478164Z #define _GLIBCXX_STD_C std 2025-05-07T20:26:32.6478285Z #define cudaInitDeviceFlagsAreValid 0x01 2025-05-07T20:26:32.6478378Z #define _LARGEFILE64_SOURCE 1 2025-05-07T20:26:32.6478476Z #define _GLIBCXX_USE_C99_STDINT_TR1 1 2025-05-07T20:26:32.6478570Z #define _STDDEF_H_ 2025-05-07T20:26:32.6478748Z #define __FLT64_NORM_MAX__ 1.79769313486231570814527423731704357e+308F64 2025-05-07T20:26:32.6478849Z #define __FLT128_HAS_QUIET_NAN__ 1 2025-05-07T20:26:32.6478966Z #define isalnum_l(c,l) __isalnum_l ((c), (l)) 2025-05-07T20:26:32.6479170Z #define __FD_ISSET(d,set) ((__FDS_BITS (set)[__FD_ELT (d)] & __FD_MASK (d)) != 0) 2025-05-07T20:26:32.6479283Z #define __INTMAX_MAX__ 0x7fffffffffffffffL 2025-05-07T20:26:32.6479422Z #define __glibcxx_requires_irreflexive(_First,_Last) 2025-05-07T20:26:32.6479540Z #define cudaGraphKernelNodePortDefault 0 2025-05-07T20:26:32.6479645Z #define __INT_FAST8_TYPE__ signed char 2025-05-07T20:26:32.6479762Z #define __cudaCDP2Memcpy3DAsync_ptsz 2025-05-07T20:26:32.6479857Z #define __PID_T_TYPE __S32_TYPE 2025-05-07T20:26:32.6479977Z #define __cpp_namespace_attributes 201411L 2025-05-07T20:26:32.6480071Z #define CHARCLASS_NAME_MAX 2048 2025-05-07T20:26:32.6480165Z #define _GLIBCXX_HAVE_TANF 1 2025-05-07T20:26:32.6480260Z #define _GLIBCXX_USE_ST_MTIM 1 2025-05-07T20:26:32.6480439Z #define __FLT64X_MIN__ 3.36210314311209350626267781732175260e-4932F64x 2025-05-07T20:26:32.6480535Z #define __CUDA_RUNTIME_H__ 2025-05-07T20:26:32.6480718Z #define WIFSIGNALED(status) __WIFSIGNALED (__WAIT_INT (status)) 2025-05-07T20:26:32.6480815Z #define _GLIBCXX_HAVE_STDLIB_H 1 2025-05-07T20:26:32.6480913Z #define __STDCPP_THREADS__ 1 2025-05-07T20:26:32.6481056Z #define M_2_SQRTPIl 1.128379167095512573896158903121545172L 2025-05-07T20:26:32.6481150Z #define __GNUC_STDC_INLINE__ 1 2025-05-07T20:26:32.6481250Z #define _POSIX_UIO_MAXIOV 16 2025-05-07T20:26:32.6481347Z #define _PSTL_PAR_BACKEND_SERIAL 2025-05-07T20:26:32.6481475Z #define __ASSERT_FUNCTION __PRETTY_FUNCTION__ 2025-05-07T20:26:32.6481572Z #define __FLT64_HAS_DENORM__ 1 2025-05-07T20:26:32.6481671Z #define __WORDSIZE_TIME64_COMPAT32 1 2025-05-07T20:26:32.6481861Z #define _GLIBCXX_DEPRECATED __attribute__ ((__deprecated__)) 2025-05-07T20:26:32.6482036Z #define __FLT32_EPSILON__ 1.19209289550781250000000000000000000e-7F32 2025-05-07T20:26:32.6482136Z #define _PSTL_HIDE_FROM_ABI_PUSH 2025-05-07T20:26:32.6482264Z #define cudaStreamLegacy ((cudaStream_t)0x1) 2025-05-07T20:26:32.6482373Z #define _IO_cleanup_region_start(_fct,_fp) 2025-05-07T20:26:32.6482473Z #define __location__(a) __annotate__(a) 2025-05-07T20:26:32.6482722Z #define __device_builtin_surface_type__ __location__(device_builtin_surface_type) 2025-05-07T20:26:32.6483132Z #define _POSIX2_BC_BASE_MAX 99 2025-05-07T20:26:32.6483285Z #define __cudaCDP2DeviceGetAttribute 2025-05-07T20:26:32.6483389Z #define __DBL_DECIMAL_DIG__ 17 2025-05-07T20:26:32.6483478Z #define __STDC_UTF_32__ 1 2025-05-07T20:26:32.6483819Z #define __INT_FAST8_WIDTH__ 8 2025-05-07T20:26:32.6483918Z #define NAN (__builtin_nanf ("")) 2025-05-07T20:26:32.6484012Z #define _POSIX_MQ_PRIO_MAX 32 2025-05-07T20:26:32.6484096Z #define __FXSR__ 1 2025-05-07T20:26:32.6484174Z #define _SIZE_T 2025-05-07T20:26:32.6484277Z #define _GLIBCXX_USE_GETTIMEOFDAY 1 2025-05-07T20:26:32.6484395Z #define cudaHostRegisterReadOnly 0x08 2025-05-07T20:26:32.6484568Z #define __FLT32X_MAX__ 1.79769313486231570814527423731704357e+308F32x 2025-05-07T20:26:32.6484721Z #define __WIFSTOPPED(status) (((status) & 0xff) == 0x7f) 2025-05-07T20:26:32.6484821Z #define _IO_ssize_t __ssize_t 2025-05-07T20:26:32.6484917Z #define __ULONG32_TYPE unsigned int 2025-05-07T20:26:32.6485112Z #define __DBL_NORM_MAX__ double(1.79769313486231570814527423731704357e+308L) 2025-05-07T20:26:32.6485318Z #define cudaStreamGraphTailLaunch (cudaStream_t)0x0100000000000000 2025-05-07T20:26:32.6485407Z #define _GXX_NULLPTR_T 2025-05-07T20:26:32.6485537Z #define __glibcxx_class_requires3(_a,_b,_c,_d) 2025-05-07T20:26:32.6485758Z #define FOPEN_MAX 16 2025-05-07T20:26:32.6485846Z #define __BIG_ENDIAN 4321 2025-05-07T20:26:32.6485973Z #define __BYTE_ORDER__ __ORDER_LITTLE_ENDIAN__ 2025-05-07T20:26:32.6486069Z #define __suseconds_t_defined 2025-05-07T20:26:32.6486154Z #define __off_t_defined 2025-05-07T20:26:32.6486245Z #define stderr stderr 2025-05-07T20:26:32.6486341Z #define M_LOG10E 0.43429448190325182765 2025-05-07T20:26:32.6486452Z #define __glibcxx_requires_string(_String) 2025-05-07T20:26:32.6486562Z #define _GLIBCXX_HAVE_LDEXPL 1 2025-05-07T20:26:32.6486656Z #define __INTMAX_WIDTH__ 64 2025-05-07T20:26:32.6487099Z #define _PSTL_CPP14_2RANGE_MISMATCH_EQUAL_PRESENT (_MSC_VER >= 1900 || __cplusplus >= 201300L || __cpp_lib_robust_nonmodifying_seq_ops == 201304) 2025-05-07T20:26:32.6487188Z #define __mode_t_defined 2025-05-07T20:26:32.6487269Z #define _GCC_SIZE_T 2025-05-07T20:26:32.6487370Z #define __INO64_T_TYPE __UQUAD_TYPE 2025-05-07T20:26:32.6487471Z #define __cpp_runtime_arrays 198712L 2025-05-07T20:26:32.6487592Z #define __UINT64_TYPE__ long unsigned int 2025-05-07T20:26:32.6487690Z #define __USE_XOPEN2K8XSI 1 2025-05-07T20:26:32.6487844Z #define __UINT32_C(c) c ## U 2025-05-07T20:26:32.6487978Z #define __cpp_alias_templates 200704L 2025-05-07T20:26:32.6488151Z #define cudaHostAllocMapped 0x02 2025-05-07T20:26:32.6488315Z #define __DEVICE_LAUNCH_PARAMETERS_H__ 2025-05-07T20:26:32.6488642Z #define _STL_ITERATOR_H 1 2025-05-07T20:26:32.6488910Z #define __size_t__ 2025-05-07T20:26:32.6489073Z #define cudaStreamAttrID cudaLaunchAttributeID 2025-05-07T20:26:32.6489201Z #define _GLIBCXX_HAVE_ATANF 1 2025-05-07T20:26:32.6489394Z #define cudaEventRecordExternal 0x01 2025-05-07T20:26:32.6489562Z #define __isspace_l(c,l) __isctype_l((c), _ISspace, (l)) 2025-05-07T20:26:32.6489804Z #define _IO_BUFSIZ _G_BUFSIZ 2025-05-07T20:26:32.6490008Z #define __FLT_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F 2025-05-07T20:26:32.6490122Z #define _ENDIAN_H 1 2025-05-07T20:26:32.6490303Z #define __builtin_align__(a) __align__(a) 2025-05-07T20:26:32.6490450Z #define _GLIBCXX20_CONSTEXPR 2025-05-07T20:26:32.6490567Z #define __NV_NO_HOST_COMPILER_CHECK 1 2025-05-07T20:26:32.6490775Z #define __try try 2025-05-07T20:26:32.6490901Z #define _GLIBCXX_HAVE_FINITE 1 2025-05-07T20:26:32.6491026Z #define __FLT128_IS_IEC_60559__ 2 2025-05-07T20:26:32.6491203Z #define __INT8_MAX__ 0x7f 2025-05-07T20:26:32.6491506Z #define cudaStreamGetCaptureInfo __CUDART_API_PTSZ(cudaStreamGetCaptureInfo_v2) 2025-05-07T20:26:32.6491692Z #define __LONG_WIDTH__ 64 2025-05-07T20:26:32.6491819Z #define __PIC__ 2 2025-05-07T20:26:32.6491960Z #define BC_STRING_MAX _POSIX2_BC_STRING_MAX 2025-05-07T20:26:32.6492145Z #define __UINT_FAST32_TYPE__ long unsigned int 2025-05-07T20:26:32.6492330Z #define FD_ISSET(fd,fdsetp) __FD_ISSET (fd, fdsetp) 2025-05-07T20:26:32.6492455Z #define _GLIBCXX_HAVE_FLOAT_H 1 2025-05-07T20:26:32.6492644Z #define _GLIBCXX_HAVE_ATANL 1 2025-05-07T20:26:32.6493049Z #define __FLT32X_NORM_MAX__ 1.79769313486231570814527423731704357e+308F32x 2025-05-07T20:26:32.6493188Z #define __DEVICE_FUNCTIONS_HPP__ 2025-05-07T20:26:32.6493372Z #define __CHAR32_TYPE__ unsigned int 2025-05-07T20:26:32.6493490Z #define _IO_uid_t __uid_t 2025-05-07T20:26:32.6493636Z #define _GLIBCXX_HAVE_READLINK 1 2025-05-07T20:26:32.6493843Z #define __cudaCDP2EventRecordWithFlags_ptsz 2025-05-07T20:26:32.6493978Z #define _CONCEPT_CHECK_H 1 2025-05-07T20:26:32.6494214Z #define __FLT_MAX__ 3.40282346638528859811704183484516925e+38F 2025-05-07T20:26:32.6494349Z #define _GLIBCXX_HAVE_NETINET_IN_H 1 2025-05-07T20:26:32.6494500Z #define _GLIBCXX_TR1_SPECIAL_FUNCTION_UTIL_H 1 2025-05-07T20:26:32.6494632Z #define LONG_BIT 64 2025-05-07T20:26:32.6494813Z #define __SIZEOF_PTHREAD_BARRIERATTR_T 4 2025-05-07T20:26:32.6494956Z #define _GLIBCXX_USE_ALLOCATOR_NEW 1 2025-05-07T20:26:32.6495170Z #define __cpp_lib_math_special_functions 201603L 2025-05-07T20:26:32.6495291Z #define __fsfilcnt_t_defined 2025-05-07T20:26:32.6495449Z #define __blkcnt_t_defined 2025-05-07T20:26:32.6495831Z #define cudaKernelNodeAttributeMemSyncDomain cudaLaunchAttributeMemSyncDomain 2025-05-07T20:26:32.6495996Z #define __USE_LARGEFILE 1 2025-05-07T20:26:32.6496204Z #define __cpp_constexpr 201603L 2025-05-07T20:26:32.6496326Z #define CUDART_VERSION 12080 2025-05-07T20:26:32.6496443Z #define NL_TEXTMAX INT_MAX 2025-05-07T20:26:32.6496609Z #define cudaDeviceMapHost 0x08 2025-05-07T20:26:32.6496710Z #define _GLIBCXX_CMATH 1 2025-05-07T20:26:32.6497151Z #define __attribute_format_arg__(x) __attribute__ ((__format_arg__ (x))) 2025-05-07T20:26:32.6497271Z #define __lldiv_t_defined 1 2025-05-07T20:26:32.6497381Z #define __SSE2__ 1 2025-05-07T20:26:32.6497525Z #define _IOLBF 1 2025-05-07T20:26:32.6497655Z #define _GLIBCXX_HAVE_SYS_TYPES_H 1 2025-05-07T20:26:32.6497765Z #define _GLIBCXX_HAVE_FLOORF 1 2025-05-07T20:26:32.6498019Z #define __cpp_deduction_guides 201703L 2025-05-07T20:26:32.6498144Z #define _GLIBCXX_HAVE_EXPF 1 2025-05-07T20:26:32.6498281Z #define __annotate__(a) __attribute__((a)) 2025-05-07T20:26:32.6498447Z #define __INT32_TYPE__ int 2025-05-07T20:26:32.6498569Z #define __SIZEOF_DOUBLE__ 8 2025-05-07T20:26:32.6498688Z #define cudaDeviceSyncMemops 0x80 2025-05-07T20:26:32.6498940Z #define __cpp_exceptions 199711L 2025-05-07T20:26:32.6499064Z #define __FLT_MIN_10_EXP__ (-37) 2025-05-07T20:26:32.6499240Z #define cudaDeviceScheduleYield 0x02 2025-05-07T20:26:32.6499363Z #define _SYS_SYSMACROS_H 1 2025-05-07T20:26:32.6499510Z #define _GLIBCXX_TR1_LEGENDRE_FUNCTION_TCC 1 2025-05-07T20:26:32.6499795Z #define __FLT64_MIN__ 2.22507385850720138309023271733240406e-308F64 2025-05-07T20:26:32.6499936Z #define __INT_LEAST32_WIDTH__ 32 2025-05-07T20:26:32.6500063Z #define __SWORD_TYPE long int 2025-05-07T20:26:32.6500226Z #define __INTMAX_TYPE__ long int 2025-05-07T20:26:32.6500352Z #define _GLIBCXX11_USE_C99_MATH 1 2025-05-07T20:26:32.6500496Z #define __PTHREAD_SPINS 0, 0 2025-05-07T20:26:32.6500770Z #define _BITS_POSIX1_LIM_H 1 2025-05-07T20:26:32.6501119Z #define cudaStreamAttributeMemSyncDomainMap cudaLaunchAttributeMemSyncDomainMap 2025-05-07T20:26:32.6501282Z #define __DEC128_MAX_EXP__ 6145 2025-05-07T20:26:32.6501458Z #define math_errhandling (MATH_ERRNO | MATH_ERREXCEPT) 2025-05-07T20:26:32.6501571Z #define _T_SIZE 2025-05-07T20:26:32.6501749Z #define cudaHostAllocDefault 0x00 2025-05-07T20:26:32.6501958Z #define _PSTL_PRAGMA_SIMD_EXCLUSIVE_SCAN(PRM) 2025-05-07T20:26:32.6502127Z #define __va_arg_pack() __builtin_va_arg_pack () 2025-05-07T20:26:32.6502285Z #define _POSIX_TIMER_MAX 32 2025-05-07T20:26:32.6502404Z #define _GLIBCXX_HAVE_TLS 1 2025-05-07T20:26:32.6502614Z #define _GLIBCXX_NOTHROW _GLIBCXX_USE_NOEXCEPT 2025-05-07T20:26:32.6502726Z #define __FLT32X_HAS_QUIET_NAN__ 1 2025-05-07T20:26:32.6502908Z #define __ATOMIC_CONSUME 1 2025-05-07T20:26:32.6503172Z #define __CUDA_ARCH_HAS_FEATURE__(_FEAT) __CUDA_ARCH_FEAT_ ##_FEAT 2025-05-07T20:26:32.6503287Z #define __GNUC_MINOR__ 4 2025-05-07T20:26:32.6503418Z #define __GLIBCXX_TYPE_INT_N_0 __int128 2025-05-07T20:26:32.6503601Z #define __INT_FAST16_WIDTH__ 64 2025-05-07T20:26:32.6503825Z #define __UINTMAX_MAX__ 0xffffffffffffffffUL 2025-05-07T20:26:32.6504340Z #define __PIE__ 2 2025-05-07T20:26:32.6504781Z #define LITTLE_ENDIAN __LITTLE_ENDIAN 2025-05-07T20:26:32.6504938Z #define _GLIBCXX_HAVE_INT64_T_LONG 1 2025-05-07T20:26:32.6505282Z #define __FLT32X_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F32x 2025-05-07T20:26:32.6505543Z #define __intN_t(N,MODE) typedef int int ##N ##_t __attribute__ ((__mode__ (MODE))) 2025-05-07T20:26:32.6505654Z #define __nlink_t_defined 2025-05-07T20:26:32.6505917Z #define _GLIBCXX17_DEPRECATED [[__deprecated__]] 2025-05-07T20:26:32.6506066Z #define _PSTL_STRING(x) _PSTL_STRING_AUX(x) 2025-05-07T20:26:32.6506205Z #define _XOPEN_LIM_H 1 2025-05-07T20:26:32.6506545Z #define __u_intN_t(N,MODE) typedef unsigned int u_int ##N ##_t __attribute__ ((__mode__ (MODE))) 2025-05-07T20:26:32.6506694Z #define __cpp_template_template_args 201611L 2025-05-07T20:26:32.6506895Z #define _GTHREAD_USE_MUTEX_TIMEDLOCK 1 2025-05-07T20:26:32.6507159Z #define BC_DIM_MAX _POSIX2_BC_DIM_MAX 2025-05-07T20:26:32.6507286Z #define __DBL_MAX_10_EXP__ 308 2025-05-07T20:26:32.6507436Z #define __FILE_defined 1 2025-05-07T20:26:32.6507651Z #define __LDBL_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951L 2025-05-07T20:26:32.6507777Z #define _GLIBCXX_HAVE_SINCOS 1 2025-05-07T20:26:32.6507966Z #define __USE_XOPEN_EXTENDED 1 2025-05-07T20:26:32.6508142Z #define __cpp_lib_tuple_element_t 201402L 2025-05-07T20:26:32.6508290Z #define isascii_l(c,l) __isascii_l ((c), (l)) 2025-05-07T20:26:32.6508460Z #define cudaInvalidDeviceId ((int)-2) 2025-05-07T20:26:32.6508590Z #define _GLIBCXX_HAVE_SYS_RESOURCE_H 1 2025-05-07T20:26:32.6508722Z #define __INT16_C(c) c 2025-05-07T20:26:32.6509052Z #define __U32_TYPE unsigned int 2025-05-07T20:26:32.6509334Z #define _GLIBCXX_HAVE_SYS_IOCTL_H 1 2025-05-07T20:26:32.6509525Z #define FD_CLR(fd,fdsetp) __FD_CLR (fd, fdsetp) 2025-05-07T20:26:32.6509831Z #define __STDC__ 1 2025-05-07T20:26:32.6509986Z #define _GLIBCXX_HAVE_VWSCANF 1 2025-05-07T20:26:32.6510147Z #define _GLIBCXX_HAVE_EXECINFO_H 1 2025-05-07T20:26:32.6510351Z #define _GLIBCXX_USE_REALPATH 1 2025-05-07T20:26:32.6510548Z #define __attribute_malloc__ __attribute__ ((__malloc__)) 2025-05-07T20:26:32.6510697Z #define __FLT32X_DIG__ 15 2025-05-07T20:26:32.6510826Z #define _GLIBCXX_USE_C99_CTYPE_TR1 1 2025-05-07T20:26:32.6510986Z #define __PTRDIFF_TYPE__ long int 2025-05-07T20:26:32.6511111Z #define cudaArrayDeferredMapping 0x80 2025-05-07T20:26:32.6511318Z #define _GLIBCXX_END_NAMESPACE_CONTAINER 2025-05-07T20:26:32.6511493Z #define USHRT_MAX (SHRT_MAX * 2 + 1) 2025-05-07T20:26:32.6511624Z #define __cpp_lib_is_swappable 201603 2025-05-07T20:26:32.6511737Z #define stdin stdin 2025-05-07T20:26:32.6511889Z #define __ino64_t_defined 2025-05-07T20:26:32.6512010Z #define STA_CLK 0x8000 2025-05-07T20:26:32.6512177Z #define __clockid_t_defined 1 2025-05-07T20:26:32.6512401Z #define _GLIBCXX_NOEXCEPT_IF(...) noexcept(__VA_ARGS__) 2025-05-07T20:26:32.6512603Z #define __attribute_noinline__ __attribute__ ((__noinline__)) 2025-05-07T20:26:32.6512773Z #define __cudaCDP2MemsetAsync 2025-05-07T20:26:32.6512925Z #define _PSTL_PRAGMA_SIMD_SCAN(PRM) 2025-05-07T20:26:32.6513044Z #define _GLIBCXX_BEGIN_NAMESPACE_LDBL 2025-05-07T20:26:32.6513268Z #define _GLIBCXX_TR1_POLY_HERMITE_TCC 1 2025-05-07T20:26:32.6513501Z #define __FD_SET(d,set) ((void) (__FDS_BITS (set)[__FD_ELT (d)] |= __FD_MASK (d))) 2025-05-07T20:26:32.6513695Z #define __ATOMIC_SEQ_CST 5 2025-05-07T20:26:32.6514337Z #define __tobody(c,f,a,args) (__extension__ ({ int __res; if (sizeof (c) > 1) { if (__builtin_constant_p (c)) { int __c = (c); __res = __c < -128 || __c > 255 ? __c : (a)[__c]; } else __res = f args; } else __res = (a)[(int) (c)]; __res; })) 2025-05-07T20:26:32.6514471Z #define DOMAIN 1 2025-05-07T20:26:32.6514677Z #define M_LN2 0.69314718055994530942 2025-05-07T20:26:32.6514804Z #define __NVCC__ 1 2025-05-07T20:26:32.6514935Z #define __cudaCDP2Memset2DAsync 2025-05-07T20:26:32.6515108Z #define __CLOCK_T_TYPE __SYSCALL_SLONG_TYPE 2025-05-07T20:26:32.6515366Z #define _PSTL_PRAGMA_SIMD_EARLYEXIT 2025-05-07T20:26:32.6515500Z #define __throw_exception_again throw 2025-05-07T20:26:32.6515696Z #define M_SQRT2 1.41421356237309504880 2025-05-07T20:26:32.6515828Z #define __EXCEPTION_H 1 2025-05-07T20:26:32.6515988Z #define __FLT32X_MIN_10_EXP__ (-307) 2025-05-07T20:26:32.6516143Z #define HUGE_VAL (__builtin_huge_val()) 2025-05-07T20:26:32.6516488Z #define cudaStreamAttributeAccessPolicyWindow cudaLaunchAttributeAccessPolicyWindow 2025-05-07T20:26:32.6516645Z #define __UINTPTR_TYPE__ long unsigned int 2025-05-07T20:26:32.6516824Z #define _GLIBCXX_INLINE_VERSION 0 2025-05-07T20:26:32.6516961Z #define _GLIBCXX_USE_INT128 1 2025-05-07T20:26:32.6517149Z #define __cpp_lib_bool_constant 201505 2025-05-07T20:26:32.6517277Z #define PTHREAD_KEYS_MAX 1024 2025-05-07T20:26:32.6517448Z #define __DEC64_SUBNORMAL_MIN__ 0.000000000000001E-383DD 2025-05-07T20:26:32.6517605Z #define __FSFILCNT64_T_TYPE __UQUAD_TYPE 2025-05-07T20:26:32.6517881Z #define _GLIBCXX_DOUBLE_IS_IEEE_BINARY64 1 2025-05-07T20:26:32.6518147Z #define __DEC128_MANT_DIG__ 34 2025-05-07T20:26:32.6518279Z #define __cpp_lib_tuples_by_type 201304 2025-05-07T20:26:32.6518405Z #define __LDBL_MIN_10_EXP__ (-4931) 2025-05-07T20:26:32.6518570Z #define __cpp_generic_lambdas 201304L 2025-05-07T20:26:32.6518718Z #define _GLIBCXX_THROW_OR_ABORT(_EXC) (throw (_EXC)) 2025-05-07T20:26:32.6518898Z #define __useconds_t_defined 2025-05-07T20:26:32.6519096Z #define _GLIBCXX_USE_SCHED_YIELD 1 2025-05-07T20:26:32.6519312Z #define __attribute_deprecated__ __attribute__ ((__deprecated__)) 2025-05-07T20:26:32.6519522Z #define __cpp_lib_type_trait_variable_templates 201510L 2025-05-07T20:26:32.6519637Z #define __SSE_MATH__ 1 2025-05-07T20:26:32.6519740Z #define _IO_wint_t wint_t 2025-05-07T20:26:32.6519987Z #define __SIZEOF_LONG_LONG__ 8 2025-05-07T20:26:32.6520108Z #define _GLIBCXX_VERBOSE 1 2025-05-07T20:26:32.6520230Z #define _GLIBCXX_HAVE_ASINF 1 2025-05-07T20:26:32.6520412Z #define __cpp_user_defined_literals 200809L 2025-05-07T20:26:32.6520542Z #define _GLIBCXX_HAVE_ISINFL 1 2025-05-07T20:26:32.6520646Z #define _GLIBCXX_HAVE_ASINL 1 2025-05-07T20:26:32.6520890Z #define __USE_ATFILE 1 2025-05-07T20:26:32.6521013Z #define _POSIX_OPEN_MAX 20 2025-05-07T20:26:32.6521136Z #define _POSIX_LOGIN_NAME_MAX 9 2025-05-07T20:26:32.6521287Z #define _GCC_PTRDIFF_T 2025-05-07T20:26:32.6521547Z #define cudaKernelNodeAttributePriority cudaLaunchAttributePriority 2025-05-07T20:26:32.6521760Z #define __FLT128_DECIMAL_DIG__ 36 2025-05-07T20:26:32.6521971Z #define _POSIX_THREAD_KEYS_MAX 128 2025-05-07T20:26:32.6522102Z #define __GCC_ATOMIC_LLONG_LOCK_FREE 2 2025-05-07T20:26:32.6522278Z #define __cpp_lib_array_constexpr 201803L 2025-05-07T20:26:32.6522390Z #define _STDLIB_H 1 2025-05-07T20:26:32.6522561Z #define __exctype(name) extern int name (int) __THROW 2025-05-07T20:26:32.6522775Z #define __FLT32_HAS_QUIET_NAN__ 1 2025-05-07T20:26:32.6522911Z #define __FLT_DECIMAL_DIG__ 9 2025-05-07T20:26:32.6523107Z #define __UINT_FAST16_MAX__ 0xffffffffffffffffUL 2025-05-07T20:26:32.6523250Z #define __SURFACE_INDIRECT_FUNCTIONS_H__ 2025-05-07T20:26:32.6523371Z #define __SM_61_INTRINSICS_H__ 2025-05-07T20:26:32.6523630Z #define _GLIBCXX_PACKAGE_STRING "package-unused version-unused" 2025-05-07T20:26:32.6523862Z #define __isxdigit_l(c,l) __isctype_l((c), _ISxdigit, (l)) 2025-05-07T20:26:32.6524009Z #define __glibcxx_requires_nonempty() 2025-05-07T20:26:32.6524190Z #define w_stopsig __wait_stopped.__w_stopsig 2025-05-07T20:26:32.6524309Z #define __ldiv_t_defined 1 2025-05-07T20:26:32.6524520Z #define __glibcxx_requires_irreflexive_pred(_First,_Last,_Pred) 2025-05-07T20:26:32.6524682Z #define ___int_ptrdiff_t_h 2025-05-07T20:26:32.6524936Z #define __LDBL_NORM_MAX__ 1.18973149535723176502126385303097021e+4932L 2025-05-07T20:26:32.6525117Z #define __cudaCDP2EventDestroy 2025-05-07T20:26:32.6525236Z #define __HOST_DEFINES_H__ 2025-05-07T20:26:32.6525366Z #define __GCC_ATOMIC_SHORT_LOCK_FREE 2 2025-05-07T20:26:32.6525635Z #define __SM_20_ATOMIC_FUNCTIONS_H__ 2025-05-07T20:26:32.6525828Z #define _GLIBCXX_USE_NANOSLEEP 1 2025-05-07T20:26:32.6525987Z #define CUDART_CB 2025-05-07T20:26:32.6526171Z #define BC_BASE_MAX _POSIX2_BC_BASE_MAX 2025-05-07T20:26:32.6526327Z #define _GLIBCXX_USE_C99_INTTYPES_WCHAR_T_TR1 1 2025-05-07T20:26:32.6526465Z #define MB_LEN_MAX 16 2025-05-07T20:26:32.6526765Z #define __glibcxx_requires_partitioned_lower_pred(_First,_Last,_Value,_Pred) 2025-05-07T20:26:32.6526877Z #define _GLIBCXX11_USE_C99_WCHAR 1 2025-05-07T20:26:32.6527124Z #define _IO_peekc(_fp) _IO_peekc_unlocked (_fp) 2025-05-07T20:26:32.6527266Z #define _GLIBCXX_HAVE_AS_SYMVER_DIRECTIVE 1 2025-05-07T20:26:32.6527393Z #define _GLIBCXX_HAVE_UNISTD_H 1 2025-05-07T20:26:32.6527634Z #define __glibc_likely(cond) __builtin_expect((cond), 1) 2025-05-07T20:26:32.6527773Z #define __UINT_FAST8_TYPE__ unsigned char 2025-05-07T20:26:32.6527870Z #define _GNU_SOURCE 1 2025-05-07T20:26:32.6528075Z #define __stub_putmsg 2025-05-07T20:26:32.6528186Z #define __CUDACC__ 1 2025-05-07T20:26:32.6528455Z #define __N(msgid) (msgid) 2025-05-07T20:26:32.6528568Z #define __P(args) args 2025-05-07T20:26:32.6528860Z #define cudaKernelNodeAttributeCooperative cudaLaunchAttributeCooperative 2025-05-07T20:26:32.6529061Z #define __cpp_init_captures 201304L 2025-05-07T20:26:32.6529209Z #define _GLIBCXX17_CONSTEXPR constexpr 2025-05-07T20:26:32.6529354Z #define __ATOMIC_ACQ_REL 4 2025-05-07T20:26:32.6529521Z #define __cpp_lib_as_const 201510 2025-05-07T20:26:32.6529701Z #define __WCHAR_T 2025-05-07T20:26:32.6529822Z #define __ATOMIC_RELEASE 3 2025-05-07T20:26:32.6530007Z #define __fsblkcnt_t_defined 2025-05-07T20:26:32.6530189Z #define __cudaCDP2EventCreateWithFlags 2025-05-07T20:26:32.6530355Z #define __DEVICE_DOUBLE_FUNCTIONS_H__ 2025-05-07T20:26:32.6530362Z 2025-05-07T20:26:32.6724041Z 2025-05-07T20:26:32.6724852Z + conda run -n build_binary nvcc --version 2025-05-07T20:26:32.6724876Z 2025-05-07T20:26:34.5613766Z nvcc: NVIDIA (R) Cuda compiler driver 2025-05-07T20:26:34.5614431Z Copyright (c) 2005-2025 NVIDIA Corporation 2025-05-07T20:26:34.5614920Z Built on Wed_Jan_15_19:20:09_PST_2025 2025-05-07T20:26:34.5615369Z Cuda compilation tools, release 12.8, V12.8.61 2025-05-07T20:26:34.5615785Z Build cuda_12.8.r12.8/compiler.35404655_0 2025-05-07T20:26:34.5616006Z 2025-05-07T20:26:34.6257376Z 2025-05-07T20:26:34.6270313Z /usr/bin/nvidia-smi 2025-05-07T20:26:34.6273901Z + nvidia-smi 2025-05-07T20:26:34.6274106Z 2025-05-07T20:26:34.6445858Z Wed May 7 20:26:34 2025 2025-05-07T20:26:34.6446615Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:26:34.6447237Z | NVIDIA-SMI 570.133.07 Driver Version: 570.133.07 CUDA Version: 12.8 | 2025-05-07T20:26:34.6447827Z |-----------------------------------------+------------------------+----------------------+ 2025-05-07T20:26:34.6448482Z | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | 2025-05-07T20:26:34.6449177Z | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | 2025-05-07T20:26:34.6449727Z | | | MIG M. | 2025-05-07T20:26:34.6450294Z |=========================================+========================+======================| 2025-05-07T20:26:34.6613657Z | 0 NVIDIA A10G On | 00000000:00:1E.0 Off | 0 | 2025-05-07T20:26:34.6615088Z | 0% 26C P8 19W / 300W | 0MiB / 23028MiB | 0% Default | 2025-05-07T20:26:34.6616290Z | | | N/A | 2025-05-07T20:26:34.6617158Z +-----------------------------------------+------------------------+----------------------+ 2025-05-07T20:26:34.6617874Z 2025-05-07T20:26:34.6618931Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:26:34.6619487Z | Processes: | 2025-05-07T20:26:34.6619987Z | GPU GI CI PID Type Process name GPU Memory | 2025-05-07T20:26:34.6620597Z | ID ID Usage | 2025-05-07T20:26:34.6621043Z |=========================================================================================| 2025-05-07T20:26:34.6634384Z | No running processes found | 2025-05-07T20:26:34.6635261Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:26:34.9398529Z 2025-05-07T20:26:34.9402626Z [INSTALL] Successfully installed CUDA 12.8.0 2025-05-07T20:26:34.9452094Z ##[group]Run . $PRELUDE; install_pytorch_pip $BUILD_ENV nightly cuda/12.8.0 2025-05-07T20:26:34.9452657Z . $PRELUDE; install_pytorch_pip $BUILD_ENV nightly cuda/12.8.0 2025-05-07T20:26:34.9464414Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:26:34.9464768Z env: 2025-05-07T20:26:34.9464984Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:26:34.9465280Z BUILD_ENV: build_binary 2025-05-07T20:26:34.9465523Z BUILD_TARGET: genai 2025-05-07T20:26:34.9465745Z BUILD_VARIANT: cuda 2025-05-07T20:26:34.9465968Z BUILD_CUDA_VERSION: 12.8.0 2025-05-07T20:26:34.9466221Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:26:34.9466520Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:26:34.9466854Z ##[endgroup] 2025-05-07T20:26:35.2825519Z ################################################################################ 2025-05-07T20:26:35.2826054Z # Install PyTorch (PIP) 2025-05-07T20:26:35.2826373Z # 2025-05-07T20:26:35.2841088Z # [2025-05-07T20:26:35.283Z] + install_pytorch_pip build_binary nightly cuda/12.8.0 2025-05-07T20:26:35.2841745Z ################################################################################ 2025-05-07T20:26:35.2842080Z 2025-05-07T20:26:35.2871738Z [EXEC] [ATTEMPT 0/3] + conda install -n build_binary -c conda-forge --override-channels -y numpy 2025-05-07T20:26:36.2827473Z Channels: 2025-05-07T20:26:36.2827780Z - conda-forge 2025-05-07T20:26:36.2828087Z Platform: linux-64 2025-05-07T20:26:39.5817421Z Collecting package metadata (repodata.json): - \ | / done 2025-05-07T20:26:40.3057357Z Solving environment: \ | / done 2025-05-07T20:26:40.5244126Z 2025-05-07T20:26:40.5244282Z ## Package Plan ## 2025-05-07T20:26:40.5244432Z 2025-05-07T20:26:40.5244642Z environment location: /home/ec2-user/miniconda/envs/build_binary 2025-05-07T20:26:40.5244961Z 2025-05-07T20:26:40.5245117Z added / updated specs: 2025-05-07T20:26:40.5245473Z - numpy 2025-05-07T20:26:40.5245629Z 2025-05-07T20:26:40.5245655Z 2025-05-07T20:26:40.5245809Z The following packages will be downloaded: 2025-05-07T20:26:40.5246096Z 2025-05-07T20:26:40.5246246Z package | build 2025-05-07T20:26:40.5246671Z ---------------------------|----------------- 2025-05-07T20:26:40.5247072Z libblas-3.9.0 |31_h59b9bed_openblas 16 KB conda-forge 2025-05-07T20:26:40.5247544Z libcblas-3.9.0 |31_he106b2a_openblas 16 KB conda-forge 2025-05-07T20:26:40.5248011Z libgfortran-15.1.0 | h69a702a_2 34 KB conda-forge 2025-05-07T20:26:40.5248482Z libgfortran5-15.1.0 | hcea5267_2 1.5 MB conda-forge 2025-05-07T20:26:40.5248957Z liblapack-3.9.0 |31_h7ac8fdf_openblas 16 KB conda-forge 2025-05-07T20:26:40.5249451Z libopenblas-0.3.29 |pthreads_h94d23a6_0 5.6 MB conda-forge 2025-05-07T20:26:40.5249919Z numpy-2.0.2 | py39h9cb892a_1 7.6 MB conda-forge 2025-05-07T20:26:40.5250314Z ------------------------------------------------------------ 2025-05-07T20:26:40.5250665Z Total: 14.8 MB 2025-05-07T20:26:40.5250894Z 2025-05-07T20:26:40.5251020Z The following NEW packages will be INSTALLED: 2025-05-07T20:26:40.5251244Z 2025-05-07T20:26:40.5251474Z libblas conda-forge/linux-64::libblas-3.9.0-31_h59b9bed_openblas 2025-05-07T20:26:40.5251992Z libcblas conda-forge/linux-64::libcblas-3.9.0-31_he106b2a_openblas 2025-05-07T20:26:40.5252517Z libgfortran conda-forge/linux-64::libgfortran-15.1.0-h69a702a_2 2025-05-07T20:26:40.5253045Z libgfortran5 conda-forge/linux-64::libgfortran5-15.1.0-hcea5267_2 2025-05-07T20:26:40.5253587Z liblapack conda-forge/linux-64::liblapack-3.9.0-31_h7ac8fdf_openblas 2025-05-07T20:26:40.5254151Z libopenblas conda-forge/linux-64::libopenblas-0.3.29-pthreads_h94d23a6_0 2025-05-07T20:26:40.5254946Z numpy conda-forge/linux-64::numpy-2.0.2-py39h9cb892a_1 2025-05-07T20:26:40.5255275Z 2025-05-07T20:26:40.5255279Z 2025-05-07T20:26:40.5255283Z 2025-05-07T20:26:40.5255574Z Downloading and Extracting Packages: ...working... 2025-05-07T20:26:40.5255962Z numpy-2.0.2 | 7.6 MB | | 0% 2025-05-07T20:26:40.5256191Z 2025-05-07T20:26:40.5256458Z libopenblas-0.3.29 | 5.6 MB | | 0%  2025-05-07T20:26:40.5256718Z 2025-05-07T20:26:40.5256929Z 2025-05-07T20:26:40.5280872Z libgfortran5-15.1.0 | 1.5 MB | | 0%  2025-05-07T20:26:40.5281305Z 2025-05-07T20:26:40.5281311Z 2025-05-07T20:26:40.5281316Z 2025-05-07T20:26:40.5284491Z libgfortran-15.1.0 | 34 KB | | 0%  2025-05-07T20:26:40.5284894Z 2025-05-07T20:26:40.5284903Z 2025-05-07T20:26:40.5284908Z 2025-05-07T20:26:40.5288570Z 2025-05-07T20:26:40.5301809Z libblas-3.9.0 | 16 KB | | 0%  2025-05-07T20:26:40.5302095Z 2025-05-07T20:26:40.5302100Z 2025-05-07T20:26:40.5302133Z 2025-05-07T20:26:40.5302138Z 2025-05-07T20:26:40.5302143Z 2025-05-07T20:26:40.5303144Z libcblas-3.9.0 | 16 KB | | 0%  2025-05-07T20:26:40.5303434Z 2025-05-07T20:26:40.5303439Z 2025-05-07T20:26:40.5303446Z 2025-05-07T20:26:40.5303450Z 2025-05-07T20:26:40.5303455Z 2025-05-07T20:26:40.5303460Z 2025-05-07T20:26:40.5914337Z liblapack-3.9.0 | 16 KB | | 0%  2025-05-07T20:26:40.5914656Z 2025-05-07T20:26:40.5914660Z 2025-05-07T20:26:40.5914664Z 2025-05-07T20:26:40.5917165Z 2025-05-07T20:26:40.6746410Z libblas-3.9.0 | 16 KB | ########## | 100%  2025-05-07T20:26:40.6746706Z 2025-05-07T20:26:40.6746711Z 2025-05-07T20:26:40.6781146Z 2025-05-07T20:26:40.6784145Z libgfortran-15.1.0 | 34 KB | ####7 | 47%  2025-05-07T20:26:40.6784454Z 2025-05-07T20:26:40.6784460Z 2025-05-07T20:26:40.6784466Z 2025-05-07T20:26:40.6784472Z 2025-05-07T20:26:40.6784480Z 2025-05-07T20:26:40.6833319Z libcblas-3.9.0 | 16 KB | #########7 | 98%  2025-05-07T20:26:40.6833605Z 2025-05-07T20:26:40.6836089Z 2025-05-07T20:26:40.6836303Z 2025-05-07T20:26:40.6836340Z 2025-05-07T20:26:40.6836368Z 2025-05-07T20:26:40.6892893Z libcblas-3.9.0 | 16 KB | ########## | 100%  2025-05-07T20:26:40.6893183Z 2025-05-07T20:26:40.6893188Z 2025-05-07T20:26:40.6896815Z 2025-05-07T20:26:40.8127247Z libgfortran-15.1.0 | 34 KB | ########## | 100%  2025-05-07T20:26:40.8127550Z 2025-05-07T20:26:40.8127555Z 2025-05-07T20:26:40.8127560Z 2025-05-07T20:26:40.8127564Z 2025-05-07T20:26:40.8127569Z 2025-05-07T20:26:40.8127578Z 2025-05-07T20:26:40.8156346Z liblapack-3.9.0 | 16 KB | #########7 | 98%  2025-05-07T20:26:40.8157313Z 2025-05-07T20:26:40.8186057Z libopenblas-0.3.29 | 5.6 MB | | 0%  2025-05-07T20:26:40.8186433Z 2025-05-07T20:26:40.8186778Z 2025-05-07T20:26:40.8186795Z 2025-05-07T20:26:40.8186841Z 2025-05-07T20:26:40.8186849Z 2025-05-07T20:26:40.8188128Z 2025-05-07T20:26:40.8631480Z liblapack-3.9.0 | 16 KB | ########## | 100%  2025-05-07T20:26:40.8631945Z 2025-05-07T20:26:40.8631954Z 2025-05-07T20:26:40.8631961Z 2025-05-07T20:26:40.8631966Z 2025-05-07T20:26:40.8635733Z libblas-3.9.0 | 16 KB | ########## | 100%  2025-05-07T20:26:40.8636021Z 2025-05-07T20:26:40.8636025Z 2025-05-07T20:26:40.8636029Z 2025-05-07T20:26:40.8637831Z 2025-05-07T20:26:40.8722279Z libblas-3.9.0 | 16 KB | ########## | 100%  2025-05-07T20:26:40.8755724Z numpy-2.0.2 | 7.6 MB | | 0% 2025-05-07T20:26:40.8755978Z 2025-05-07T20:26:40.8755982Z 2025-05-07T20:26:40.8755986Z 2025-05-07T20:26:40.8755990Z 2025-05-07T20:26:40.8757869Z 2025-05-07T20:26:40.8866861Z libcblas-3.9.0 | 16 KB | ########## | 100%  2025-05-07T20:26:40.8867154Z 2025-05-07T20:26:40.8868831Z 2025-05-07T20:26:40.9065007Z libgfortran5-15.1.0 | 1.5 MB | 1 | 1%  2025-05-07T20:26:40.9065518Z 2025-05-07T20:26:40.9065525Z 2025-05-07T20:26:40.9065530Z 2025-05-07T20:26:40.9079995Z libgfortran-15.1.0 | 34 KB | ########## | 100%  2025-05-07T20:26:40.9080542Z 2025-05-07T20:26:40.9080547Z 2025-05-07T20:26:40.9081140Z 2025-05-07T20:26:40.9157094Z libgfortran-15.1.0 | 34 KB | ########## | 100%  2025-05-07T20:26:40.9161397Z 2025-05-07T20:26:40.9288154Z libopenblas-0.3.29 | 5.6 MB | #####9 | 60%  2025-05-07T20:26:40.9502402Z 2025-05-07T20:26:40.9502411Z 2025-05-07T20:26:40.9502418Z 2025-05-07T20:26:40.9502428Z 2025-05-07T20:26:40.9502435Z 2025-05-07T20:26:40.9502443Z 2025-05-07T20:26:40.9502878Z liblapack-3.9.0 | 16 KB | ########## | 100%  2025-05-07T20:26:40.9503181Z 2025-05-07T20:26:40.9503185Z 2025-05-07T20:26:40.9722395Z libgfortran5-15.1.0 | 1.5 MB | ########## | 100%  2025-05-07T20:26:40.9917143Z numpy-2.0.2 | 7.6 MB | ########6 | 86% 2025-05-07T20:26:40.9920461Z 2025-05-07T20:26:41.0050144Z libopenblas-0.3.29 | 5.6 MB | ########## | 100%  2025-05-07T20:26:41.0490700Z numpy-2.0.2 | 7.6 MB | ########## | 100% 2025-05-07T20:26:41.0491039Z 2025-05-07T20:26:41.0491043Z 2025-05-07T20:26:41.0495199Z libgfortran5-15.1.0 | 1.5 MB | ########## | 100%  2025-05-07T20:26:41.0495492Z 2025-05-07T20:26:41.0495778Z 2025-05-07T20:26:41.1661026Z libgfortran5-15.1.0 | 1.5 MB | ########## | 100%  2025-05-07T20:26:41.1661355Z 2025-05-07T20:26:41.1662488Z libopenblas-0.3.29 | 5.6 MB | ########## | 100%  2025-05-07T20:26:41.1662861Z 2025-05-07T20:26:41.4711284Z libopenblas-0.3.29 | 5.6 MB | ########## | 100%  2025-05-07T20:26:41.4721392Z numpy-2.0.2 | 7.6 MB | ########## | 100% 2025-05-07T20:26:41.4721745Z 2025-05-07T20:26:41.4721953Z 2025-05-07T20:26:41.4722150Z  2025-05-07T20:26:41.4722366Z 2025-05-07T20:26:41.4722381Z 2025-05-07T20:26:41.4722549Z  2025-05-07T20:26:41.4722781Z 2025-05-07T20:26:41.4722784Z 2025-05-07T20:26:41.4722788Z 2025-05-07T20:26:41.4722956Z  2025-05-07T20:26:41.4723169Z 2025-05-07T20:26:41.4723172Z 2025-05-07T20:26:41.4723176Z 2025-05-07T20:26:41.4723180Z 2025-05-07T20:26:41.4723361Z  2025-05-07T20:26:41.4723576Z 2025-05-07T20:26:41.4723579Z 2025-05-07T20:26:41.4723583Z 2025-05-07T20:26:41.4723586Z 2025-05-07T20:26:41.4723590Z 2025-05-07T20:26:41.4723772Z  2025-05-07T20:26:41.4723993Z 2025-05-07T20:26:41.4723996Z 2025-05-07T20:26:41.4724000Z 2025-05-07T20:26:41.4724004Z 2025-05-07T20:26:41.4724007Z 2025-05-07T20:26:41.4724011Z 2025-05-07T20:26:41.4724265Z  done 2025-05-07T20:26:41.5726661Z Preparing transaction: \ done 2025-05-07T20:26:41.7736651Z Verifying transaction: / - done 2025-05-07T20:26:41.8746775Z Executing transaction: | done 2025-05-07T20:26:42.0540480Z ################################################################################ 2025-05-07T20:26:42.0541014Z # Install Package From PyTorch PIP: torch 2025-05-07T20:26:42.0541378Z # 2025-05-07T20:26:42.0559038Z # [2025-05-07T20:26:42.055Z] + install_from_pytorch_pip build_binary torch nightly cuda/12.8.0 2025-05-07T20:26:42.0559694Z ################################################################################ 2025-05-07T20:26:42.0559937Z 2025-05-07T20:26:42.0576218Z [EXEC] [ATTEMPT 0/3] + wget -q --timeout 1 pypi.org -O /dev/null 2025-05-07T20:26:42.1471067Z [CHECK] Network does not appear to be blocked. 2025-05-07T20:26:42.1471576Z ################################################################################ 2025-05-07T20:26:42.1472287Z # Prepare PIP Arguments (PyTorch PIP) 2025-05-07T20:26:42.1472591Z # 2025-05-07T20:26:42.1489948Z # [2025-05-07T20:26:42.148Z] + __prepare_pip_arguments torch nightly cuda/12.8.0 2025-05-07T20:26:42.1490687Z ################################################################################ 2025-05-07T20:26:42.1490923Z 2025-05-07T20:26:42.1513429Z [INSTALL] Extracted package (channel, version): (nightly, LATEST) 2025-05-07T20:26:42.1537627Z [INSTALL] Extracted package variant: cu128 2025-05-07T20:26:42.1553700Z [INSTALL] Using a non-RELEASE channel: nightly ... 2025-05-07T20:26:42.1554255Z [INSTALL] Extracted the full PIP channel: https://download.pytorch.org/whl/nightly/cu128/ 2025-05-07T20:26:42.1562719Z [INSTALL] Extracted the full PIP package: --pre torch 2025-05-07T20:26:42.1571484Z [INSTALL] Attempting to install [torch, LATEST] from PyTorch PIP using channel https://download.pytorch.org/whl/nightly/cu128/ ... 2025-05-07T20:26:42.1592982Z [EXEC] [ATTEMPT 0/3] + conda run -n build_binary pip install --pre torch --index-url https://download.pytorch.org/whl/nightly/cu128/ 2025-05-07T20:28:18.4533908Z Looking in indexes: https://download.pytorch.org/whl/nightly/cu128/ 2025-05-07T20:28:18.4534563Z Collecting torch 2025-05-07T20:28:18.4535301Z Downloading https://download.pytorch.org/whl/nightly/cu128/torch-2.8.0.dev20250507%2Bcu128-cp39-cp39-manylinux_2_28_x86_64.whl.metadata (30 kB) 2025-05-07T20:28:18.4536325Z Collecting filelock (from torch) 2025-05-07T20:28:18.4537029Z Downloading https://download.pytorch.org/whl/nightly/filelock-3.16.1-py3-none-any.whl (16 kB) 2025-05-07T20:28:18.4538206Z Requirement already satisfied: typing-extensions>=4.10.0 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages (from torch) (4.13.2) 2025-05-07T20:28:18.4539013Z Collecting sympy>=1.13.3 (from torch) 2025-05-07T20:28:18.4539535Z Downloading https://download.pytorch.org/whl/nightly/sympy-1.13.3-py3-none-any.whl (6.2 MB) 2025-05-07T20:28:18.4540693Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 6.2/6.2 MB 34.5 MB/s eta 0:00:00 2025-05-07T20:28:18.4541174Z Collecting networkx (from torch) 2025-05-07T20:28:18.4541696Z Downloading https://download.pytorch.org/whl/nightly/networkx-3.2.1-py3-none-any.whl (1.6 MB) 2025-05-07T20:28:18.4542388Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.6/1.6 MB 16.3 MB/s eta 0:00:00 2025-05-07T20:28:18.4542740Z Collecting jinja2 (from torch) 2025-05-07T20:28:18.4543240Z Downloading https://download.pytorch.org/whl/nightly/jinja2-3.1.4-py3-none-any.whl (133 kB) 2025-05-07T20:28:18.4543778Z Collecting fsspec (from torch) 2025-05-07T20:28:18.4544290Z Downloading https://download.pytorch.org/whl/nightly/fsspec-2024.10.0-py3-none-any.whl (179 kB) 2025-05-07T20:28:18.4544895Z Collecting nvidia-cuda-nvrtc-cu12==12.8.61 (from torch) 2025-05-07T20:28:18.4545779Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cuda_nvrtc_cu12-12.8.61-py3-none-manylinux2010_x86_64.manylinux_2_12_x86_64.whl.metadata (1.7 kB) 2025-05-07T20:28:18.4546686Z Collecting nvidia-cuda-runtime-cu12==12.8.57 (from torch) 2025-05-07T20:28:18.4547575Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cuda_runtime_cu12-12.8.57-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (1.7 kB) 2025-05-07T20:28:18.4548477Z Collecting nvidia-cuda-cupti-cu12==12.8.57 (from torch) 2025-05-07T20:28:18.4549352Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cuda_cupti_cu12-12.8.57-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (1.7 kB) 2025-05-07T20:28:18.4550393Z Collecting nvidia-cudnn-cu12==9.8.0.87 (from torch) 2025-05-07T20:28:18.4551136Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cudnn_cu12-9.8.0.87-py3-none-manylinux_2_27_x86_64.whl.metadata (1.8 kB) 2025-05-07T20:28:18.4551893Z Collecting nvidia-cublas-cu12==12.8.3.14 (from torch) 2025-05-07T20:28:18.4552979Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cublas_cu12-12.8.3.14-py3-none-manylinux_2_27_x86_64.whl.metadata (1.7 kB) 2025-05-07T20:28:18.4553752Z Collecting nvidia-cufft-cu12==11.3.3.41 (from torch) 2025-05-07T20:28:18.4554591Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cufft_cu12-11.3.3.41-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (1.5 kB) 2025-05-07T20:28:18.4555647Z Collecting nvidia-curand-cu12==10.3.9.55 (from torch) 2025-05-07T20:28:18.4556407Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_curand_cu12-10.3.9.55-py3-none-manylinux_2_27_x86_64.whl.metadata (1.5 kB) 2025-05-07T20:28:18.4557182Z Collecting nvidia-cusolver-cu12==11.7.2.55 (from torch) 2025-05-07T20:28:18.4557950Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cusolver_cu12-11.7.2.55-py3-none-manylinux_2_27_x86_64.whl.metadata (1.6 kB) 2025-05-07T20:28:18.4558727Z Collecting nvidia-cusparse-cu12==12.5.7.53 (from torch) 2025-05-07T20:28:18.4559599Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cusparse_cu12-12.5.7.53-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (1.6 kB) 2025-05-07T20:28:18.4560463Z Collecting nvidia-cusparselt-cu12==0.6.3 (from torch) 2025-05-07T20:28:18.4561318Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cusparselt_cu12-0.6.3-py3-none-manylinux2014_x86_64.whl.metadata (6.8 kB) 2025-05-07T20:28:18.4562081Z Collecting nvidia-nccl-cu12==2.26.2 (from torch) 2025-05-07T20:28:18.4562911Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_nccl_cu12-2.26.2-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (2.0 kB) 2025-05-07T20:28:18.4563734Z Collecting nvidia-nvtx-cu12==12.8.55 (from torch) 2025-05-07T20:28:18.4564561Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_nvtx_cu12-12.8.55-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (1.6 kB) 2025-05-07T20:28:18.4565412Z Collecting nvidia-nvjitlink-cu12==12.8.61 (from torch) 2025-05-07T20:28:18.4566282Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_nvjitlink_cu12-12.8.61-py3-none-manylinux2010_x86_64.manylinux_2_12_x86_64.whl.metadata (1.7 kB) 2025-05-07T20:28:18.4567164Z Collecting nvidia-cufile-cu12==1.13.0.11 (from torch) 2025-05-07T20:28:18.4568008Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cufile_cu12-1.13.0.11-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (1.5 kB) 2025-05-07T20:28:18.4568881Z Collecting pytorch-triton==3.3.0+git96316ce5 (from torch) 2025-05-07T20:28:18.4569766Z Downloading https://download.pytorch.org/whl/nightly/pytorch_triton-3.3.0%2Bgit96316ce5-cp39-cp39-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (1.6 kB) 2025-05-07T20:28:18.4571144Z Requirement already satisfied: setuptools>=40.8.0 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages (from pytorch-triton==3.3.0+git96316ce5->torch) (78.1.1) 2025-05-07T20:28:18.4572061Z Collecting mpmath<1.4,>=1.1.0 (from sympy>=1.13.3->torch) 2025-05-07T20:28:18.4572647Z Downloading https://download.pytorch.org/whl/nightly/mpmath-1.3.0-py3-none-any.whl (536 kB) 2025-05-07T20:28:18.4573340Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 536.2/536.2 kB 5.0 MB/s eta 0:00:00 2025-05-07T20:28:18.4573729Z Collecting MarkupSafe>=2.0 (from jinja2->torch) 2025-05-07T20:28:18.4574465Z Downloading https://download.pytorch.org/whl/nightly/MarkupSafe-2.1.5-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (25 kB) 2025-05-07T20:28:18.4575587Z Downloading https://download.pytorch.org/whl/nightly/cu128/torch-2.8.0.dev20250507%2Bcu128-cp39-cp39-manylinux_2_28_x86_64.whl (1047.1 MB) 2025-05-07T20:28:18.4576429Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.0/1.0 GB 21.8 MB/s eta 0:00:00 2025-05-07T20:28:18.4577159Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cublas_cu12-12.8.3.14-py3-none-manylinux_2_27_x86_64.whl (609.6 MB) 2025-05-07T20:28:18.4577992Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 609.6/609.6 MB 51.0 MB/s eta 0:00:00 2025-05-07T20:28:18.4578934Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cuda_cupti_cu12-12.8.57-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (10.2 MB) 2025-05-07T20:28:18.4579851Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 10.2/10.2 MB 169.1 MB/s eta 0:00:00 2025-05-07T20:28:18.4580757Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cuda_nvrtc_cu12-12.8.61-py3-none-manylinux2010_x86_64.manylinux_2_12_x86_64.whl (88.0 MB) 2025-05-07T20:28:18.4581668Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 88.0/88.0 MB 171.0 MB/s eta 0:00:00 2025-05-07T20:28:18.4582520Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cuda_runtime_cu12-12.8.57-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (954 kB) 2025-05-07T20:28:18.4583745Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 954.8/954.8 kB 72.6 MB/s eta 0:00:00 2025-05-07T20:28:18.4584468Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cudnn_cu12-9.8.0.87-py3-none-manylinux_2_27_x86_64.whl (698.0 MB) 2025-05-07T20:28:18.4585298Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 698.0/698.0 MB 41.1 MB/s eta 0:00:00 2025-05-07T20:28:18.4586109Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cufft_cu12-11.3.3.41-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (193.1 MB) 2025-05-07T20:28:18.4587022Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 193.1/193.1 MB 137.5 MB/s eta 0:00:00 2025-05-07T20:28:18.4587832Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cufile_cu12-1.13.0.11-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (1.2 MB) 2025-05-07T20:28:18.4588722Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.2/1.2 MB 102.3 MB/s eta 0:00:00 2025-05-07T20:28:18.4589452Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_curand_cu12-10.3.9.55-py3-none-manylinux_2_27_x86_64.whl (63.6 MB) 2025-05-07T20:28:18.4590353Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 63.6/63.6 MB 146.7 MB/s eta 0:00:00 2025-05-07T20:28:18.4591103Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cusolver_cu12-11.7.2.55-py3-none-manylinux_2_27_x86_64.whl (260.4 MB) 2025-05-07T20:28:18.4591926Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 260.4/260.4 MB 127.8 MB/s eta 0:00:00 2025-05-07T20:28:18.4592772Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cusparse_cu12-12.5.7.53-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (292.1 MB) 2025-05-07T20:28:18.4593688Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 292.1/292.1 MB 109.1 MB/s eta 0:00:00 2025-05-07T20:28:18.4594421Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cusparselt_cu12-0.6.3-py3-none-manylinux2014_x86_64.whl (156.8 MB) 2025-05-07T20:28:18.4595247Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 156.8/156.8 MB 136.5 MB/s eta 0:00:00 2025-05-07T20:28:18.4596046Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_nccl_cu12-2.26.2-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (201.3 MB) 2025-05-07T20:28:18.4597095Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 201.3/201.3 MB 130.5 MB/s eta 0:00:00 2025-05-07T20:28:18.4597921Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_nvjitlink_cu12-12.8.61-py3-none-manylinux2010_x86_64.manylinux_2_12_x86_64.whl (39.2 MB) 2025-05-07T20:28:18.4598953Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 39.2/39.2 MB 161.8 MB/s eta 0:00:00 2025-05-07T20:28:18.4599742Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_nvtx_cu12-12.8.55-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (89 kB) 2025-05-07T20:28:18.4600976Z Downloading https://download.pytorch.org/whl/nightly/pytorch_triton-3.3.0%2Bgit96316ce5-cp39-cp39-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (153.4 MB) 2025-05-07T20:28:18.4601894Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 153.4/153.4 MB 129.3 MB/s eta 0:00:00 2025-05-07T20:28:18.4603812Z Installing collected packages: nvidia-cusparselt-cu12, mpmath, sympy, pytorch-triton, nvidia-nvtx-cu12, nvidia-nvjitlink-cu12, nvidia-nccl-cu12, nvidia-curand-cu12, nvidia-cufile-cu12, nvidia-cuda-runtime-cu12, nvidia-cuda-nvrtc-cu12, nvidia-cuda-cupti-cu12, nvidia-cublas-cu12, networkx, MarkupSafe, fsspec, filelock, nvidia-cusparse-cu12, nvidia-cufft-cu12, nvidia-cudnn-cu12, jinja2, nvidia-cusolver-cu12, torch 2025-05-07T20:28:18.4605596Z 2025-05-07T20:28:18.4607764Z Successfully installed MarkupSafe-2.1.5 filelock-3.16.1 fsspec-2024.10.0 jinja2-3.1.4 mpmath-1.3.0 networkx-3.2.1 nvidia-cublas-cu12-12.8.3.14 nvidia-cuda-cupti-cu12-12.8.57 nvidia-cuda-nvrtc-cu12-12.8.61 nvidia-cuda-runtime-cu12-12.8.57 nvidia-cudnn-cu12-9.8.0.87 nvidia-cufft-cu12-11.3.3.41 nvidia-cufile-cu12-1.13.0.11 nvidia-curand-cu12-10.3.9.55 nvidia-cusolver-cu12-11.7.2.55 nvidia-cusparse-cu12-12.5.7.53 nvidia-cusparselt-cu12-0.6.3 nvidia-nccl-cu12-2.26.2 nvidia-nvjitlink-cu12-12.8.61 nvidia-nvtx-cu12-12.8.55 pytorch-triton-3.3.0+git96316ce5 sympy-1.13.3 torch-2.8.0.dev20250507+cu128 2025-05-07T20:28:18.4610031Z 2025-05-07T20:28:20.6807990Z torch 2.8.0.dev20250507+cu128 2025-05-07T20:28:20.6809945Z [CHECK] The installed package [torch, nightly/LATEST] is the correct variant (cu128) 2025-05-07T20:28:24.1280971Z [CHECK] Python (sub-)package 'torch.distributed' found ... 2025-05-07T20:28:27.5710726Z [CHECK] NOTE: The installed version is: 2.8.0.dev20250507+cu128 2025-05-07T20:28:27.5711349Z [CHECK] NOTE: Checking _GLIBCXX_USE_CXX11_ABI ... 2025-05-07T20:28:30.9468982Z True 2025-05-07T20:28:30.9469228Z True 2025-05-07T20:28:30.9469335Z 2025-05-07T20:28:31.0102249Z [INSTALL] Successfully installed PyTorch through PyTorch PIP 2025-05-07T20:28:31.0153989Z ##[group]Run if . $PRELUDE && which conda; then collect_pytorch_env_info $BUILD_ENV; fi 2025-05-07T20:28:31.0154618Z if . $PRELUDE && which conda; then collect_pytorch_env_info $BUILD_ENV; fi 2025-05-07T20:28:31.0169109Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:28:31.0169472Z env: 2025-05-07T20:28:31.0169695Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:28:31.0170003Z BUILD_ENV: build_binary 2025-05-07T20:28:31.0170245Z BUILD_TARGET: genai 2025-05-07T20:28:31.0170478Z BUILD_VARIANT: cuda 2025-05-07T20:28:31.0170710Z BUILD_CUDA_VERSION: 12.8.0 2025-05-07T20:28:31.0170964Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:28:31.0171285Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:28:31.0171631Z ##[endgroup] 2025-05-07T20:28:31.3519343Z /home/ec2-user/miniconda/bin/conda 2025-05-07T20:28:31.3521106Z ################################################################################ 2025-05-07T20:28:31.3521609Z # Collect PyTorch Environment Information (for Reporting Issues) 2025-05-07T20:28:31.3521996Z # 2025-05-07T20:28:31.3538426Z # [2025-05-07T20:28:31.353Z] + collect_pytorch_env_info build_binary 2025-05-07T20:28:31.3538830Z ################################################################################ 2025-05-07T20:28:31.3539054Z 2025-05-07T20:28:31.3554385Z [EXEC] [ATTEMPT 0/3] + wget -q --timeout 1 pypi.org -O /dev/null 2025-05-07T20:28:31.4592208Z [CHECK] Network does not appear to be blocked. 2025-05-07T20:28:31.4603270Z [INFO] Downloading the PyTorch environment info collection script ... 2025-05-07T20:28:31.4603924Z + wget -q https://raw.githubusercontent.com/pytorch/pytorch/main/torch/utils/collect_env.py 2025-05-07T20:28:31.4606191Z 2025-05-07T20:28:31.5501068Z 2025-05-07T20:28:31.5501693Z [INFO] Collecting PyTorch environment info (will be needed for reporting issues to PyTorch) ... 2025-05-07T20:28:31.5524876Z [EXEC] [ATTEMPT 0/3] + conda run -n build_binary python collect_env.py 2025-05-07T20:28:37.9066761Z Collecting environment information... 2025-05-07T20:28:37.9067177Z PyTorch version: 2.8.0.dev20250507+cu128 2025-05-07T20:28:37.9067489Z Is debug build: False 2025-05-07T20:28:37.9067739Z CUDA used to build PyTorch: 12.8 2025-05-07T20:28:37.9068028Z ROCM used to build PyTorch: N/A 2025-05-07T20:28:37.9068202Z 2025-05-07T20:28:37.9068308Z OS: Amazon Linux 2023.6.20250317 (x86_64) 2025-05-07T20:28:37.9068625Z GCC version: (conda-forge gcc 11.4.0-13) 11.4.0 2025-05-07T20:28:37.9068949Z Clang version: Could not collect 2025-05-07T20:28:37.9069225Z CMake version: Could not collect 2025-05-07T20:28:37.9069501Z Libc version: glibc-2.34 2025-05-07T20:28:37.9069674Z 2025-05-07T20:28:37.9070150Z Python version: 3.9.18 | packaged by conda-forge | (main, Dec 23 2023, 16:33:10) [GCC 12.3.0] (64-bit runtime) 2025-05-07T20:28:37.9070802Z Python platform: Linux-6.1.130-139.222.amzn2023.x86_64-x86_64-with-glibc2.34 2025-05-07T20:28:37.9071224Z Is CUDA available: True 2025-05-07T20:28:37.9071465Z CUDA runtime version: 12.8.61 2025-05-07T20:28:37.9071736Z CUDA_MODULE_LOADING set to: LAZY 2025-05-07T20:28:37.9072046Z GPU models and configuration: GPU 0: NVIDIA A10G 2025-05-07T20:28:37.9072371Z Nvidia driver version: 570.133.07 2025-05-07T20:28:37.9072650Z cuDNN version: Could not collect 2025-05-07T20:28:37.9072918Z HIP runtime version: N/A 2025-05-07T20:28:37.9073159Z MIOpen runtime version: N/A 2025-05-07T20:28:37.9073466Z Is XNNPACK available: True 2025-05-07T20:28:37.9073634Z 2025-05-07T20:28:37.9073710Z CPU: 2025-05-07T20:28:37.9073932Z Architecture: x86_64 2025-05-07T20:28:37.9074263Z CPU op-mode(s): 32-bit, 64-bit 2025-05-07T20:28:37.9074672Z Address sizes: 48 bits physical, 48 bits virtual 2025-05-07T20:28:37.9075073Z Byte Order: Little Endian 2025-05-07T20:28:37.9075743Z CPU(s): 16 2025-05-07T20:28:37.9076046Z On-line CPU(s) list: 0-15 2025-05-07T20:28:37.9076650Z Vendor ID: AuthenticAMD 2025-05-07T20:28:37.9077017Z Model name: AMD EPYC 7R32 2025-05-07T20:28:37.9077342Z CPU family: 23 2025-05-07T20:28:37.9077622Z Model: 49 2025-05-07T20:28:37.9077915Z Thread(s) per core: 2 2025-05-07T20:28:37.9078215Z Core(s) per socket: 8 2025-05-07T20:28:37.9078497Z Socket(s): 1 2025-05-07T20:28:37.9078776Z Stepping: 0 2025-05-07T20:28:37.9079091Z BogoMIPS: 5600.00 2025-05-07T20:28:37.9081359Z Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:28:37.9083909Z Hypervisor vendor: KVM 2025-05-07T20:28:37.9084220Z Virtualization type: full 2025-05-07T20:28:37.9084571Z L1d cache: 256 KiB (8 instances) 2025-05-07T20:28:37.9084949Z L1i cache: 256 KiB (8 instances) 2025-05-07T20:28:37.9085319Z L2 cache: 4 MiB (8 instances) 2025-05-07T20:28:37.9085675Z L3 cache: 32 MiB (2 instances) 2025-05-07T20:28:37.9086180Z NUMA node(s): 1 2025-05-07T20:28:37.9086480Z NUMA node0 CPU(s): 0-15 2025-05-07T20:28:37.9086813Z Vulnerability Gather data sampling: Not affected 2025-05-07T20:28:37.9087201Z Vulnerability Itlb multihit: Not affected 2025-05-07T20:28:37.9087570Z Vulnerability L1tf: Not affected 2025-05-07T20:28:37.9087916Z Vulnerability Mds: Not affected 2025-05-07T20:28:37.9088277Z Vulnerability Meltdown: Not affected 2025-05-07T20:28:37.9088638Z Vulnerability Mmio stale data: Not affected 2025-05-07T20:28:37.9089008Z Vulnerability Reg file data sampling: Not affected 2025-05-07T20:28:37.9089570Z Vulnerability Retbleed: Mitigation; untrained return thunk; SMT enabled with STIBP protection 2025-05-07T20:28:37.9090181Z Vulnerability Spec rstack overflow: Mitigation; safe RET 2025-05-07T20:28:37.9090743Z Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl 2025-05-07T20:28:37.9091466Z Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization 2025-05-07T20:28:37.9092376Z Vulnerability Spectre v2: Mitigation; Retpolines; IBPB conditional; STIBP always-on; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected 2025-05-07T20:28:37.9093092Z Vulnerability Srbds: Not affected 2025-05-07T20:28:37.9093461Z Vulnerability Tsx async abort: Not affected 2025-05-07T20:28:37.9093700Z 2025-05-07T20:28:37.9093800Z Versions of relevant libraries: 2025-05-07T20:28:37.9094068Z [pip3] numpy==2.0.2 2025-05-07T20:28:37.9094309Z [pip3] nvidia-cublas-cu12==12.8.3.14 2025-05-07T20:28:37.9094613Z [pip3] nvidia-cuda-cupti-cu12==12.8.57 2025-05-07T20:28:37.9094924Z [pip3] nvidia-cuda-nvrtc-cu12==12.8.61 2025-05-07T20:28:37.9095238Z [pip3] nvidia-cuda-runtime-cu12==12.8.57 2025-05-07T20:28:37.9095555Z [pip3] nvidia-cudnn-cu12==9.8.0.87 2025-05-07T20:28:37.9095847Z [pip3] nvidia-cufft-cu12==11.3.3.41 2025-05-07T20:28:37.9096140Z [pip3] nvidia-curand-cu12==10.3.9.55 2025-05-07T20:28:37.9096442Z [pip3] nvidia-cusolver-cu12==11.7.2.55 2025-05-07T20:28:37.9096742Z [pip3] nvidia-cusparse-cu12==12.5.7.53 2025-05-07T20:28:37.9097198Z [pip3] nvidia-cusparselt-cu12==0.6.3 2025-05-07T20:28:37.9097500Z [pip3] nvidia-nccl-cu12==2.26.2 2025-05-07T20:28:37.9097783Z [pip3] nvidia-nvjitlink-cu12==12.8.61 2025-05-07T20:28:37.9098085Z [pip3] nvidia-nvtx-cu12==12.8.55 2025-05-07T20:28:37.9098372Z [pip3] pytorch-triton==3.3.0+git96316ce5 2025-05-07T20:28:37.9098672Z [pip3] torch==2.8.0.dev20250507+cu128 2025-05-07T20:28:37.9099059Z [conda] cuda-cudart 12.8.57 h5888daf_1 conda-forge 2025-05-07T20:28:37.9099568Z [conda] cuda-cudart-dev 12.8.57 h5888daf_1 conda-forge 2025-05-07T20:28:37.9100107Z [conda] cuda-cudart-dev_linux-64 12.8.57 h3f2d84a_1 conda-forge 2025-05-07T20:28:37.9100647Z [conda] cuda-cudart-static 12.8.57 h5888daf_1 conda-forge 2025-05-07T20:28:37.9101211Z [conda] cuda-cudart-static_linux-64 12.8.57 h3f2d84a_1 conda-forge 2025-05-07T20:28:37.9101775Z [conda] cuda-cudart_linux-64 12.8.57 h3f2d84a_1 conda-forge 2025-05-07T20:28:37.9102275Z [conda] cuda-cupti 12.8.57 hbd13f7d_0 conda-forge 2025-05-07T20:28:37.9102845Z [conda] cuda-cupti-dev 12.8.57 h5888daf_0 conda-forge 2025-05-07T20:28:37.9103354Z [conda] cuda-libraries 12.8.0 ha770c72_0 conda-forge 2025-05-07T20:28:37.9103874Z [conda] cuda-libraries-dev 12.8.0 ha770c72_0 conda-forge 2025-05-07T20:28:37.9104364Z [conda] cuda-nvrtc 12.8.61 hbd13f7d_0 conda-forge 2025-05-07T20:28:37.9104855Z [conda] cuda-nvrtc-dev 12.8.61 h5888daf_0 conda-forge 2025-05-07T20:28:37.9105394Z [conda] cuda-nvtx 12.8.55 hbd13f7d_0 conda-forge 2025-05-07T20:28:37.9105979Z [conda] cuda-opencl 12.8.55 hbd13f7d_0 conda-forge 2025-05-07T20:28:37.9106472Z [conda] cuda-opencl-dev 12.8.55 h5888daf_0 conda-forge 2025-05-07T20:28:37.9106975Z [conda] cuda-runtime 12.8.0 ha804496_0 conda-forge 2025-05-07T20:28:37.9107456Z [conda] libcublas 12.8.3.14 h9ab20c4_0 conda-forge 2025-05-07T20:28:37.9107942Z [conda] libcublas-dev 12.8.3.14 h9ab20c4_0 conda-forge 2025-05-07T20:28:37.9108426Z [conda] libcufft 11.3.3.41 hbd13f7d_0 conda-forge 2025-05-07T20:28:37.9108901Z [conda] libcufft-dev 11.3.3.41 h5888daf_0 conda-forge 2025-05-07T20:28:37.9109384Z [conda] libcurand 10.3.9.55 hbd13f7d_0 conda-forge 2025-05-07T20:28:37.9109986Z [conda] libcurand-dev 10.3.9.55 h5888daf_0 conda-forge 2025-05-07T20:28:37.9110486Z [conda] libcusolver 11.7.2.55 h9ab20c4_0 conda-forge 2025-05-07T20:28:37.9110988Z [conda] libcusolver-dev 11.7.2.55 h9ab20c4_0 conda-forge 2025-05-07T20:28:37.9111491Z [conda] libcusparse 12.5.7.53 hbd13f7d_0 conda-forge 2025-05-07T20:28:37.9111993Z [conda] libcusparse-dev 12.5.7.53 h5888daf_0 conda-forge 2025-05-07T20:28:37.9112496Z [conda] libnvjitlink 12.8.61 hbd13f7d_0 conda-forge 2025-05-07T20:28:37.9113001Z [conda] libnvjitlink-dev 12.8.61 h5888daf_0 conda-forge 2025-05-07T20:28:37.9113477Z [conda] numpy 2.0.2 py39h9cb892a_1 conda-forge 2025-05-07T20:28:37.9113958Z [conda] nvidia-cublas-cu12 12.8.3.14 pypi_0 pypi 2025-05-07T20:28:37.9114478Z [conda] nvidia-cuda-cupti-cu12 12.8.57 pypi_0 pypi 2025-05-07T20:28:37.9114993Z [conda] nvidia-cuda-nvrtc-cu12 12.8.61 pypi_0 pypi 2025-05-07T20:28:37.9115522Z [conda] nvidia-cuda-runtime-cu12 12.8.57 pypi_0 pypi 2025-05-07T20:28:37.9116032Z [conda] nvidia-cudnn-cu12 9.8.0.87 pypi_0 pypi 2025-05-07T20:28:37.9116621Z [conda] nvidia-cufft-cu12 11.3.3.41 pypi_0 pypi 2025-05-07T20:28:37.9117138Z [conda] nvidia-curand-cu12 10.3.9.55 pypi_0 pypi 2025-05-07T20:28:37.9117668Z [conda] nvidia-cusolver-cu12 11.7.2.55 pypi_0 pypi 2025-05-07T20:28:37.9118178Z [conda] nvidia-cusparse-cu12 12.5.7.53 pypi_0 pypi 2025-05-07T20:28:37.9118692Z [conda] nvidia-cusparselt-cu12 0.6.3 pypi_0 pypi 2025-05-07T20:28:37.9119205Z [conda] nvidia-nccl-cu12 2.26.2 pypi_0 pypi 2025-05-07T20:28:37.9119707Z [conda] nvidia-nvjitlink-cu12 12.8.61 pypi_0 pypi 2025-05-07T20:28:37.9120209Z [conda] nvidia-nvtx-cu12 12.8.55 pypi_0 pypi 2025-05-07T20:28:37.9120704Z [conda] pytorch-triton 3.3.0+git96316ce5 pypi_0 pypi 2025-05-07T20:28:37.9121182Z [conda] torch 2.8.0.dev20250507+cu128 pypi_0 pypi 2025-05-07T20:28:37.9121466Z 2025-05-07T20:28:37.9853091Z ##[group]Run . $PRELUDE; cd fbgemm_gpu; prepare_fbgemm_gpu_build $BUILD_ENV 2025-05-07T20:28:37.9853838Z . $PRELUDE; cd fbgemm_gpu; prepare_fbgemm_gpu_build $BUILD_ENV 2025-05-07T20:28:37.9866846Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:28:37.9867205Z env: 2025-05-07T20:28:37.9867452Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:28:37.9867779Z BUILD_ENV: build_binary 2025-05-07T20:28:37.9868023Z BUILD_TARGET: genai 2025-05-07T20:28:37.9868249Z BUILD_VARIANT: cuda 2025-05-07T20:28:37.9868474Z BUILD_CUDA_VERSION: 12.8.0 2025-05-07T20:28:37.9868730Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:28:37.9869030Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:28:37.9869546Z ##[endgroup] 2025-05-07T20:28:38.3231857Z ################################################################################ 2025-05-07T20:28:38.3232261Z # Prepare FBGEMM-GPU Build 2025-05-07T20:28:38.3232513Z # 2025-05-07T20:28:38.3247324Z # [2025-05-07T20:28:38.324Z] + prepare_fbgemm_gpu_build build_binary 2025-05-07T20:28:38.3247744Z ################################################################################ 2025-05-07T20:28:38.3247971Z 2025-05-07T20:28:38.3264117Z [EXEC] [ATTEMPT 0/3] + wget -q --timeout 1 pypi.org -O /dev/null 2025-05-07T20:28:38.4170393Z [CHECK] Network does not appear to be blocked. 2025-05-07T20:28:38.4192656Z [BUILD] Running git submodules update ... 2025-05-07T20:28:38.4215334Z [EXEC] [ATTEMPT 0/3] + git submodule sync 2025-05-07T20:28:38.4577511Z Synchronizing submodule url for '../external/asmjit' 2025-05-07T20:28:38.4577996Z Synchronizing submodule url for '../external/composable_kernel' 2025-05-07T20:28:38.4578470Z Synchronizing submodule url for '../external/cpuinfo' 2025-05-07T20:28:38.4578877Z Synchronizing submodule url for '../external/cutlass' 2025-05-07T20:28:38.4579287Z Synchronizing submodule url for '../external/googletest' 2025-05-07T20:28:38.4579943Z Synchronizing submodule url for '../external/hipify_torch' 2025-05-07T20:28:38.4580467Z Synchronizing submodule url for '../external/json' 2025-05-07T20:28:38.4612917Z [EXEC] [ATTEMPT 0/3] + git submodule update --init --recursive 2025-05-07T20:28:38.5164335Z [BUILD] Installing other build dependencies ... 2025-05-07T20:28:38.5185916Z [EXEC] [ATTEMPT 0/3] + conda run --no-capture-output -n build_binary python -m pip install -r requirements.txt 2025-05-07T20:28:40.9561670Z Collecting backports.tarfile (from -r requirements.txt (line 13)) 2025-05-07T20:28:41.0232994Z Downloading backports.tarfile-1.2.0-py3-none-any.whl.metadata (2.0 kB) 2025-05-07T20:28:41.1271935Z Collecting build (from -r requirements.txt (line 14)) 2025-05-07T20:28:41.1308120Z Downloading build-1.2.2.post1-py3-none-any.whl.metadata (6.5 kB) 2025-05-07T20:28:41.3941526Z Collecting cmake (from -r requirements.txt (line 15)) 2025-05-07T20:28:41.3977797Z Downloading cmake-4.0.0-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.3 kB) 2025-05-07T20:28:41.5117635Z Collecting click (from -r requirements.txt (line 16)) 2025-05-07T20:28:41.5152134Z Downloading click-8.1.8-py3-none-any.whl.metadata (2.3 kB) 2025-05-07T20:28:41.8879519Z Collecting hypothesis (from -r requirements.txt (line 17)) 2025-05-07T20:28:41.8918458Z Downloading hypothesis-6.131.14-py3-none-any.whl.metadata (5.6 kB) 2025-05-07T20:28:41.9509238Z Requirement already satisfied: jinja2 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages (from -r requirements.txt (line 18)) (3.1.4) 2025-05-07T20:28:41.9512887Z Requirement already satisfied: mpmath==1.3.0 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages (from -r requirements.txt (line 19)) (1.3.0) 2025-05-07T20:28:42.0390518Z Collecting ninja (from -r requirements.txt (line 20)) 2025-05-07T20:28:42.0428613Z Downloading ninja-1.11.1.4-py3-none-manylinux_2_12_x86_64.manylinux2010_x86_64.whl.metadata (5.0 kB) 2025-05-07T20:28:42.0915830Z Requirement already satisfied: numpy>=2.0.2 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages (from -r requirements.txt (line 21)) (2.0.2) 2025-05-07T20:28:42.1491502Z Collecting pyre-extensions (from -r requirements.txt (line 22)) 2025-05-07T20:28:42.1527263Z Downloading pyre_extensions-0.0.32-py3-none-any.whl.metadata (4.0 kB) 2025-05-07T20:28:42.2776684Z Collecting pyyaml (from -r requirements.txt (line 23)) 2025-05-07T20:28:42.2808796Z Downloading PyYAML-6.0.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (2.1 kB) 2025-05-07T20:28:42.3986327Z Collecting scikit-build (from -r requirements.txt (line 24)) 2025-05-07T20:28:42.4016698Z Downloading scikit_build-0.18.1-py3-none-any.whl.metadata (18 kB) 2025-05-07T20:28:42.4672888Z Requirement already satisfied: setuptools in /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages (from -r requirements.txt (line 25)) (78.1.1) 2025-05-07T20:28:42.5355336Z Collecting setuptools_git_versioning (from -r requirements.txt (line 26)) 2025-05-07T20:28:42.5539681Z Downloading setuptools_git_versioning-2.1.0-py3-none-any.whl.metadata (6.1 kB) 2025-05-07T20:28:42.6492260Z Collecting tabulate (from -r requirements.txt (line 27)) 2025-05-07T20:28:42.6553306Z Downloading tabulate-0.9.0-py3-none-any.whl.metadata (34 kB) 2025-05-07T20:28:42.7847214Z Collecting patchelf (from -r requirements.txt (line 28)) 2025-05-07T20:28:42.7881448Z Downloading patchelf-0.17.2.2-py3-none-manylinux1_x86_64.manylinux_2_5_x86_64.musllinux_1_1_x86_64.whl.metadata (3.5 kB) 2025-05-07T20:28:42.9099906Z Collecting packaging>=19.1 (from build->-r requirements.txt (line 14)) 2025-05-07T20:28:42.9145284Z Downloading packaging-25.0-py3-none-any.whl.metadata (3.3 kB) 2025-05-07T20:28:43.0158507Z Collecting pyproject_hooks (from build->-r requirements.txt (line 14)) 2025-05-07T20:28:43.0196964Z Downloading pyproject_hooks-1.2.0-py3-none-any.whl.metadata (1.3 kB) 2025-05-07T20:28:43.1719505Z Collecting importlib-metadata>=4.6 (from build->-r requirements.txt (line 14)) 2025-05-07T20:28:43.1773351Z Downloading importlib_metadata-8.7.0-py3-none-any.whl.metadata (4.8 kB) 2025-05-07T20:28:43.3012700Z Collecting tomli>=1.1.0 (from build->-r requirements.txt (line 14)) 2025-05-07T20:28:43.3047335Z Downloading tomli-2.2.1-py3-none-any.whl.metadata (10 kB) 2025-05-07T20:28:43.4257205Z Collecting attrs>=22.2.0 (from hypothesis->-r requirements.txt (line 17)) 2025-05-07T20:28:43.4314521Z Downloading attrs-25.3.0-py3-none-any.whl.metadata (10 kB) 2025-05-07T20:28:43.5521319Z Collecting exceptiongroup>=1.0.0 (from hypothesis->-r requirements.txt (line 17)) 2025-05-07T20:28:43.5559499Z Downloading exceptiongroup-1.2.2-py3-none-any.whl.metadata (6.6 kB) 2025-05-07T20:28:43.6597080Z Collecting sortedcontainers<3.0.0,>=2.1.0 (from hypothesis->-r requirements.txt (line 17)) 2025-05-07T20:28:43.6639108Z Downloading sortedcontainers-2.4.0-py2.py3-none-any.whl.metadata (10 kB) 2025-05-07T20:28:43.7243906Z Requirement already satisfied: MarkupSafe>=2.0 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages (from jinja2->-r requirements.txt (line 18)) (2.1.5) 2025-05-07T20:28:43.7740008Z Collecting typing-inspect (from pyre-extensions->-r requirements.txt (line 22)) 2025-05-07T20:28:43.7776394Z Downloading typing_inspect-0.9.0-py3-none-any.whl.metadata (1.5 kB) 2025-05-07T20:28:43.8288220Z Requirement already satisfied: typing-extensions in /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages (from pyre-extensions->-r requirements.txt (line 22)) (4.13.2) 2025-05-07T20:28:43.8821112Z Collecting distro (from scikit-build->-r requirements.txt (line 24)) 2025-05-07T20:28:43.8854531Z Downloading distro-1.9.0-py3-none-any.whl.metadata (6.8 kB) 2025-05-07T20:28:43.9317805Z Requirement already satisfied: wheel>=0.32.0 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages (from scikit-build->-r requirements.txt (line 24)) (0.45.1) 2025-05-07T20:28:44.0152578Z Collecting zipp>=3.20 (from importlib-metadata>=4.6->build->-r requirements.txt (line 14)) 2025-05-07T20:28:44.0183241Z Downloading zipp-3.21.0-py3-none-any.whl.metadata (3.7 kB) 2025-05-07T20:28:44.1294220Z Collecting mypy-extensions>=0.3.0 (from typing-inspect->pyre-extensions->-r requirements.txt (line 22)) 2025-05-07T20:28:44.1322472Z Downloading mypy_extensions-1.1.0-py3-none-any.whl.metadata (1.1 kB) 2025-05-07T20:28:44.1869058Z Downloading backports.tarfile-1.2.0-py3-none-any.whl (30 kB) 2025-05-07T20:28:44.2436919Z Downloading build-1.2.2.post1-py3-none-any.whl (22 kB) 2025-05-07T20:28:44.3037650Z Downloading cmake-4.0.0-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (27.9 MB) 2025-05-07T20:28:44.8319453Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 27.9/27.9 MB 52.9 MB/s eta 0:00:00 2025-05-07T20:28:44.8351876Z Downloading click-8.1.8-py3-none-any.whl (98 kB) 2025-05-07T20:28:44.8995601Z Downloading hypothesis-6.131.14-py3-none-any.whl (500 kB) 2025-05-07T20:28:44.9777938Z Downloading sortedcontainers-2.4.0-py2.py3-none-any.whl (29 kB) 2025-05-07T20:28:45.0527258Z Downloading ninja-1.11.1.4-py3-none-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (422 kB) 2025-05-07T20:28:45.1091427Z Downloading pyre_extensions-0.0.32-py3-none-any.whl (12 kB) 2025-05-07T20:28:45.1605958Z Downloading PyYAML-6.0.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (737 kB) 2025-05-07T20:28:45.2275772Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 737.4/737.4 kB 7.4 MB/s eta 0:00:00 2025-05-07T20:28:45.2468710Z Downloading scikit_build-0.18.1-py3-none-any.whl (85 kB) 2025-05-07T20:28:45.3100258Z Downloading setuptools_git_versioning-2.1.0-py3-none-any.whl (10 kB) 2025-05-07T20:28:45.3680954Z Downloading tabulate-0.9.0-py3-none-any.whl (35 kB) 2025-05-07T20:28:45.4200775Z Downloading patchelf-0.17.2.2-py3-none-manylinux1_x86_64.manylinux_2_5_x86_64.musllinux_1_1_x86_64.whl (466 kB) 2025-05-07T20:28:45.4841483Z Downloading attrs-25.3.0-py3-none-any.whl (63 kB) 2025-05-07T20:28:45.5469079Z Downloading exceptiongroup-1.2.2-py3-none-any.whl (16 kB) 2025-05-07T20:28:45.6061249Z Downloading importlib_metadata-8.7.0-py3-none-any.whl (27 kB) 2025-05-07T20:28:45.6670499Z Downloading packaging-25.0-py3-none-any.whl (66 kB) 2025-05-07T20:28:45.7274356Z Downloading tomli-2.2.1-py3-none-any.whl (14 kB) 2025-05-07T20:28:45.7858355Z Downloading zipp-3.21.0-py3-none-any.whl (9.6 kB) 2025-05-07T20:28:45.8475643Z Downloading distro-1.9.0-py3-none-any.whl (20 kB) 2025-05-07T20:28:45.9062807Z Downloading pyproject_hooks-1.2.0-py3-none-any.whl (10 kB) 2025-05-07T20:28:45.9670753Z Downloading typing_inspect-0.9.0-py3-none-any.whl (8.8 kB) 2025-05-07T20:28:46.0252028Z Downloading mypy_extensions-1.1.0-py3-none-any.whl (5.0 kB) 2025-05-07T20:28:46.2745801Z Installing collected packages: sortedcontainers, zipp, tomli, tabulate, pyyaml, pyproject_hooks, patchelf, packaging, ninja, mypy-extensions, exceptiongroup, distro, cmake, click, backports.tarfile, attrs, typing-inspect, setuptools_git_versioning, scikit-build, importlib-metadata, hypothesis, pyre-extensions, build 2025-05-07T20:28:48.7203609Z 2025-05-07T20:28:48.7281830Z Successfully installed attrs-25.3.0 backports.tarfile-1.2.0 build-1.2.2.post1 click-8.1.8 cmake-4.0.0 distro-1.9.0 exceptiongroup-1.2.2 hypothesis-6.131.14 importlib-metadata-8.7.0 mypy-extensions-1.1.0 ninja-1.11.1.4 packaging-25.0 patchelf-0.17.2.2 pyproject_hooks-1.2.0 pyre-extensions-0.0.32 pyyaml-6.0.2 scikit-build-0.18.1 setuptools_git_versioning-2.1.0 sortedcontainers-2.4.0 tabulate-0.9.0 tomli-2.2.1 typing-inspect-0.9.0 zipp-3.21.0 2025-05-07T20:28:48.9203938Z ################################################################################ 2025-05-07T20:28:48.9204305Z # Install PyTorch (PyTorch PIP) 2025-05-07T20:28:48.9204584Z # 2025-05-07T20:28:48.9221980Z # [2025-05-07T20:28:48.921Z] + install_triton_pip build_binary 2025-05-07T20:28:48.9222399Z ################################################################################ 2025-05-07T20:28:48.9222627Z 2025-05-07T20:28:48.9222871Z [BUILD] Installing pytorch-triton nightly/3.2.0+git4b3bb1f8 from PIP ... 2025-05-07T20:28:48.9223566Z ################################################################################ 2025-05-07T20:28:48.9223949Z # Install Package From PyTorch PIP: pytorch-triton 2025-05-07T20:28:48.9224279Z # 2025-05-07T20:28:48.9238452Z # [2025-05-07T20:28:48.923Z] + install_from_pytorch_pip build_binary pytorch-triton nightly/3.2.0+git4b3bb1f8 2025-05-07T20:28:48.9239002Z ################################################################################ 2025-05-07T20:28:48.9239233Z 2025-05-07T20:28:48.9254300Z [EXEC] [ATTEMPT 0/3] + wget -q --timeout 1 pypi.org -O /dev/null 2025-05-07T20:28:49.0147743Z [CHECK] Network does not appear to be blocked. 2025-05-07T20:28:49.0148470Z ################################################################################ 2025-05-07T20:28:49.0149562Z # Prepare PIP Arguments (PyTorch PIP) 2025-05-07T20:28:49.0150092Z # 2025-05-07T20:28:49.0164854Z # [2025-05-07T20:28:49.016Z] + __prepare_pip_arguments pytorch-triton nightly/3.2.0+git4b3bb1f8 2025-05-07T20:28:49.0165375Z ################################################################################ 2025-05-07T20:28:49.0165603Z 2025-05-07T20:28:49.0211910Z [INSTALL] Extracted package (channel, version): (nightly, 3.2.0+git4b3bb1f8) 2025-05-07T20:28:49.0228425Z [INSTALL] Using a non-RELEASE channel: nightly ... 2025-05-07T20:28:49.0229041Z [INSTALL] Extracted the full PIP channel: https://download.pytorch.org/whl/nightly/ 2025-05-07T20:28:49.0237817Z [INSTALL] Extracted the full PIP package: --pre pytorch-triton==3.2.0+git4b3bb1f8 2025-05-07T20:28:49.0247659Z [INSTALL] Attempting to install [pytorch-triton, 3.2.0+git4b3bb1f8] from PyTorch PIP using channel https://download.pytorch.org/whl/nightly/ ... 2025-05-07T20:28:49.0269336Z [EXEC] [ATTEMPT 0/3] + conda run -n build_binary pip install --pre pytorch-triton==3.2.0+git4b3bb1f8 --index-url https://download.pytorch.org/whl/nightly/ 2025-05-07T20:28:56.7623838Z ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts. 2025-05-07T20:28:56.7625191Z torch 2.8.0.dev20250507+cu128 requires pytorch-triton==3.3.0+git96316ce5; platform_system == "Linux", but you have pytorch-triton 3.2.0+git4b3bb1f8 which is incompatible. 2025-05-07T20:28:56.7625894Z 2025-05-07T20:28:56.7626116Z Looking in indexes: https://download.pytorch.org/whl/nightly/ 2025-05-07T20:28:56.7626554Z Collecting pytorch-triton==3.2.0+git4b3bb1f8 2025-05-07T20:28:56.7627416Z Downloading https://download.pytorch.org/whl/nightly/pytorch_triton-3.2.0%2Bgit4b3bb1f8-cp39-cp39-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (1.3 kB) 2025-05-07T20:28:56.7628732Z Downloading https://download.pytorch.org/whl/nightly/pytorch_triton-3.2.0%2Bgit4b3bb1f8-cp39-cp39-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (166.4 MB) 2025-05-07T20:28:56.7630020Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 166.4/166.4 MB 54.5 MB/s eta 0:00:00 2025-05-07T20:28:56.7630422Z Installing collected packages: pytorch-triton 2025-05-07T20:28:56.7630788Z Attempting uninstall: pytorch-triton 2025-05-07T20:28:56.7631186Z Found existing installation: pytorch-triton 3.3.0+git96316ce5 2025-05-07T20:28:56.7631629Z Uninstalling pytorch-triton-3.3.0+git96316ce5: 2025-05-07T20:28:56.7632066Z Successfully uninstalled pytorch-triton-3.3.0+git96316ce5 2025-05-07T20:28:56.7632519Z Successfully installed pytorch-triton-3.2.0+git4b3bb1f8 2025-05-07T20:28:56.7632796Z 2025-05-07T20:28:58.9808428Z [CHECK] Python (sub-)package 'triton' found ... 2025-05-07T20:28:58.9812707Z [CHECK] Printing out the pytorch-triton version ... 2025-05-07T20:29:01.1232901Z ################################################################################ 2025-05-07T20:29:01.1233369Z [CHECK] The installed VERSION of pytorch-triton is: 3.2.0 2025-05-07T20:29:01.1233794Z ################################################################################ 2025-05-07T20:29:01.1234017Z 2025-05-07T20:29:03.1634046Z [CHECK] Python (sub-)package 'numpy' found ... 2025-05-07T20:29:05.3472799Z [CHECK] Python (sub-)package 'skbuild' found ... 2025-05-07T20:29:05.3475759Z [BUILD] Successfully ran git submodules update 2025-05-07T20:29:05.3510641Z ##[group]Run . $PRELUDE; install_fbgemm_gpu_wheel $BUILD_ENV *.whl 2025-05-07T20:29:05.3511146Z . $PRELUDE; install_fbgemm_gpu_wheel $BUILD_ENV *.whl 2025-05-07T20:29:05.3525429Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:29:05.3525780Z env: 2025-05-07T20:29:05.3526000Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:29:05.3526305Z BUILD_ENV: build_binary 2025-05-07T20:29:05.3526541Z BUILD_TARGET: genai 2025-05-07T20:29:05.3526763Z BUILD_VARIANT: cuda 2025-05-07T20:29:05.3526992Z BUILD_CUDA_VERSION: 12.8.0 2025-05-07T20:29:05.3527437Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:29:05.3527736Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:29:05.3528073Z ##[endgroup] 2025-05-07T20:29:05.6867860Z ################################################################################ 2025-05-07T20:29:05.6868260Z # Install FBGEMM-GPU from Wheel 2025-05-07T20:29:05.6868531Z # 2025-05-07T20:29:05.6884876Z # [2025-05-07T20:29:05.688Z] + install_fbgemm_gpu_wheel build_binary fbgemm_gpu_genai_nightly-2025.5.7-cp39-cp39-manylinux_2_28_x86_64.whl 2025-05-07T20:29:05.6885571Z ################################################################################ 2025-05-07T20:29:05.6885802Z 2025-05-07T20:29:05.6886181Z [INSTALL] Printing out FBGEMM-GPU wheel SHA: fbgemm_gpu_genai_nightly-2025.5.7-cp39-cp39-manylinux_2_28_x86_64.whl 2025-05-07T20:29:05.6886915Z + sha1sum fbgemm_gpu_genai_nightly-2025.5.7-cp39-cp39-manylinux_2_28_x86_64.whl 2025-05-07T20:29:05.6887274Z 2025-05-07T20:29:05.7048762Z 94d0750d60163e549c1eb2cb2a791ec2cf9a4d41 fbgemm_gpu_genai_nightly-2025.5.7-cp39-cp39-manylinux_2_28_x86_64.whl 2025-05-07T20:29:05.7051188Z 2025-05-07T20:29:05.7051618Z + sha256sum fbgemm_gpu_genai_nightly-2025.5.7-cp39-cp39-manylinux_2_28_x86_64.whl 2025-05-07T20:29:05.7052006Z 2025-05-07T20:29:05.7236382Z 4ad1704987fa87cd63915598dc05a53ebebd35ab51336336eb8f0056001f042a fbgemm_gpu_genai_nightly-2025.5.7-cp39-cp39-manylinux_2_28_x86_64.whl 2025-05-07T20:29:05.7239298Z 2025-05-07T20:29:05.7239841Z + md5sum fbgemm_gpu_genai_nightly-2025.5.7-cp39-cp39-manylinux_2_28_x86_64.whl 2025-05-07T20:29:05.7240202Z 2025-05-07T20:29:05.7576702Z 5c45ae153a493153a2b0776bec42bc74 fbgemm_gpu_genai_nightly-2025.5.7-cp39-cp39-manylinux_2_28_x86_64.whl 2025-05-07T20:29:05.7579707Z 2025-05-07T20:29:05.7589821Z [INSTALL] Installing FBGEMM-GPU wheel: fbgemm_gpu_genai_nightly-2025.5.7-cp39-cp39-manylinux_2_28_x86_64.whl ... 2025-05-07T20:29:05.7611565Z [EXEC] [ATTEMPT 0/3] + conda run -n build_binary python -m pip install fbgemm_gpu_genai_nightly-2025.5.7-cp39-cp39-manylinux_2_28_x86_64.whl 2025-05-07T20:29:08.5948730Z Processing ./fbgemm_gpu_genai_nightly-2025.5.7-cp39-cp39-manylinux_2_28_x86_64.whl 2025-05-07T20:29:08.5949747Z Requirement already satisfied: numpy in /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages (from fbgemm-gpu-genai-nightly==2025.5.7) (2.0.2) 2025-05-07T20:29:08.5950774Z Installing collected packages: fbgemm-gpu-genai-nightly 2025-05-07T20:29:08.5951232Z Successfully installed fbgemm-gpu-genai-nightly-2025.5.7 2025-05-07T20:29:08.5951513Z 2025-05-07T20:29:15.5271305Z ################################################################################ 2025-05-07T20:29:15.5271683Z [CHECK] !!!! INFO !!!! 2025-05-07T20:29:15.5272060Z [CHECK] The installed version of PyTorch is: 2.8.0.dev20250507+cu128 2025-05-07T20:29:15.5272499Z [CHECK] CUDA version reported by PyTorch is: 12.8 2025-05-07T20:29:15.5272819Z [CHECK] 2025-05-07T20:29:15.5273145Z [CHECK] NOTE: If the PyTorch package channel is different from the FBGEMM_GPU 2025-05-07T20:29:15.5273682Z [CHECK] package channel; the package may be broken at runtime!!! 2025-05-07T20:29:15.5274090Z ################################################################################ 2025-05-07T20:29:15.5274316Z 2025-05-07T20:29:15.5274429Z [INSTALL] Checking imports and symbols ... 2025-05-07T20:29:19.4582958Z [CHECK] Python (sub-)package 'fbgemm_gpu' found ... 2025-05-07T20:29:23.3710647Z [CHECK] Found symbol '__version__' in Python package 'fbgemm_gpu'. 2025-05-07T20:29:27.3055782Z [CHECK] Found symbol '__variant__' in Python package 'fbgemm_gpu'. 2025-05-07T20:29:27.3059805Z [CHECK] Printing out the FBGEMM-GPU version ... 2025-05-07T20:29:39.0598521Z ################################################################################ 2025-05-07T20:29:39.0599027Z [CHECK] The installed FBGEMM TARGET is: genai 2025-05-07T20:29:39.0599507Z [CHECK] The installed FBGEMM VARIANT is: cuda 2025-05-07T20:29:39.0599971Z [CHECK] The installed FBGEMM VERSION is: 2025.5.7 2025-05-07T20:29:39.0601880Z ################################################################################ 2025-05-07T20:29:39.0602264Z 2025-05-07T20:29:46.8937660Z ################################################################################ 2025-05-07T20:29:46.8938478Z [CHECK] FBGEMM_GPU Experimental Packages 2025-05-07T20:29:46.8941476Z [CHECK] fbgemm_gpu: ['__annotations__', '__builtins__', '__cached__', '__doc__', '__file__', '__loader__', '__name__', '__package__', '__path__', '__spec__', '__target__', '__variant__', '__version__', '_load_library', 'docs', 'fbgemm_genai_libraries', 'fbgemm_gpu', 'fbgemm_gpu_libraries', 'libraries_to_load', 'library', 'logging', 'open_source', 'os', 'split_embedding_configs', 'split_table_batched_embeddings_ops_common', 'torch', 'utils'] 2025-05-07T20:29:46.8944598Z [CHECK] fbgemm_gpu.experimental: ['__doc__', '__file__', '__loader__', '__name__', '__package__', '__path__', '__spec__'] 2025-05-07T20:29:46.8945153Z ################################################################################ 2025-05-07T20:29:46.8945390Z 2025-05-07T20:29:46.8945544Z [INSTALL] Check for installation of Python sources ... 2025-05-07T20:29:50.8113273Z [CHECK] Python (sub-)package 'fbgemm_gpu.config' found ... 2025-05-07T20:29:54.7288429Z [CHECK] Python (sub-)package 'fbgemm_gpu.docs' found ... 2025-05-07T20:29:58.7681888Z [CHECK] Python (sub-)package 'fbgemm_gpu.quantize' found ... 2025-05-07T20:30:02.6917090Z [CHECK] Python (sub-)package 'fbgemm_gpu.tbe.cache' found ... 2025-05-07T20:30:02.6921671Z [INSTALL] Check for operator registrations ... 2025-05-07T20:30:06.5370687Z fbgemm.nccl_init 2025-05-07T20:30:06.5370870Z 2025-05-07T20:30:06.6010944Z [CHECK] FBGEMM_GPU operator appears to be correctly registered: torch.ops.fbgemm.nccl_init 2025-05-07T20:30:10.4539934Z fbgemm.gqa_attn_splitk 2025-05-07T20:30:10.4540216Z 2025-05-07T20:30:10.5182361Z [CHECK] FBGEMM_GPU operator appears to be correctly registered: torch.ops.fbgemm.gqa_attn_splitk 2025-05-07T20:30:14.3672268Z fbgemm.rope_qkv_decoding 2025-05-07T20:30:14.3672508Z 2025-05-07T20:30:14.4308322Z [CHECK] FBGEMM_GPU operator appears to be correctly registered: torch.ops.fbgemm.rope_qkv_decoding 2025-05-07T20:30:14.4308986Z [INSTALL] FBGEMM-GPU installation through wheel completed ... 2025-05-07T20:30:14.4353281Z ##[group]Run . $PRELUDE; test_all_fbgemm_gpu_modules $BUILD_ENV 2025-05-07T20:30:14.4353764Z . $PRELUDE; test_all_fbgemm_gpu_modules $BUILD_ENV 2025-05-07T20:30:14.4367045Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:30:14.4367401Z env: 2025-05-07T20:30:14.4367625Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:30:14.4367923Z BUILD_ENV: build_binary 2025-05-07T20:30:14.4368165Z BUILD_TARGET: genai 2025-05-07T20:30:14.4368391Z BUILD_VARIANT: cuda 2025-05-07T20:30:14.4368622Z BUILD_CUDA_VERSION: 12.8.0 2025-05-07T20:30:14.4368872Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:30:14.4369177Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:30:14.4369516Z ##[endgroup] 2025-05-07T20:30:14.7733260Z ################################################################################ 2025-05-07T20:30:14.7733644Z # Test All FBGEMM-GPU Modules 2025-05-07T20:30:14.7733903Z # 2025-05-07T20:30:14.7749367Z # [2025-05-07T20:30:14.774Z] + test_all_fbgemm_gpu_modules build_binary 2025-05-07T20:30:14.7749794Z ################################################################################ 2025-05-07T20:30:14.7750160Z 2025-05-07T20:30:22.6019246Z [TEST] Determined FBGEMM_GPU (target : variant) from installation: (genai : cuda) 2025-05-07T20:30:22.6020058Z [TEST] Will be running tests specific to this target and variant ... 2025-05-07T20:30:22.6020599Z [TEST] Determined the test directories: 2025-05-07T20:30:22.6021027Z fbgemm_gpu/experimental/gen_ai/test 2025-05-07T20:30:22.6021445Z fbgemm_gpu/experimental/example/test 2025-05-07T20:30:22.6021849Z fbgemm_gpu/experimental/gemm/test 2025-05-07T20:30:22.6022117Z 2025-05-07T20:30:22.6030687Z [TEST] FBGEMM_GPU variant is cuda; configuring for CUDA-based testing ... 2025-05-07T20:30:22.6037576Z [TEST] Set environment variables for CUDA testing ... 2025-05-07T20:30:22.6038037Z + conda env config vars unset -n build_binary CUDA_VISIBLE_DEVICES 2025-05-07T20:30:22.6038335Z 2025-05-07T20:30:23.0259358Z 2025-05-07T20:30:23.0259820Z [TEST] Installing PyTest ... 2025-05-07T20:30:23.0282470Z [EXEC] [ATTEMPT 0/3] + conda install -n build_binary -c conda-forge --override-channels -y pytest expecttest 2025-05-07T20:30:24.1315285Z Channels: 2025-05-07T20:30:24.1315532Z - conda-forge 2025-05-07T20:30:24.1315764Z Platform: linux-64 2025-05-07T20:30:27.4373070Z Collecting package metadata (repodata.json): - \ | / done 2025-05-07T20:30:28.5846450Z Solving environment: \ | / done 2025-05-07T20:30:28.8138510Z 2025-05-07T20:30:28.8139000Z ## Package Plan ## 2025-05-07T20:30:28.8139363Z 2025-05-07T20:30:28.8139790Z environment location: /home/ec2-user/miniconda/envs/build_binary 2025-05-07T20:30:28.8140444Z 2025-05-07T20:30:28.8140627Z added / updated specs: 2025-05-07T20:30:28.8141146Z - expecttest 2025-05-07T20:30:28.8141560Z - pytest 2025-05-07T20:30:28.8141803Z 2025-05-07T20:30:28.8141813Z 2025-05-07T20:30:28.8142042Z The following packages will be downloaded: 2025-05-07T20:30:28.8142516Z 2025-05-07T20:30:28.8142740Z package | build 2025-05-07T20:30:28.8143392Z ---------------------------|----------------- 2025-05-07T20:30:28.8144164Z colorama-0.4.6 | pyhd8ed1ab_1 26 KB conda-forge 2025-05-07T20:30:28.8144825Z exceptiongroup-1.2.2 | pyhd8ed1ab_1 20 KB conda-forge 2025-05-07T20:30:28.8145310Z expecttest-0.3.0 | pyhd8ed1ab_0 14 KB conda-forge 2025-05-07T20:30:28.8145767Z iniconfig-2.0.0 | pyhd8ed1ab_1 11 KB conda-forge 2025-05-07T20:30:28.8146216Z packaging-25.0 | pyh29332c3_1 61 KB conda-forge 2025-05-07T20:30:28.8146664Z pluggy-1.5.0 | pyhd8ed1ab_1 23 KB conda-forge 2025-05-07T20:30:28.8147095Z pytest-8.3.5 | pyhd8ed1ab_0 254 KB conda-forge 2025-05-07T20:30:28.8147961Z tomli-2.2.1 | pyhd8ed1ab_1 19 KB conda-forge 2025-05-07T20:30:28.8148371Z ------------------------------------------------------------ 2025-05-07T20:30:28.8148724Z Total: 428 KB 2025-05-07T20:30:28.8148939Z 2025-05-07T20:30:28.8149073Z The following NEW packages will be INSTALLED: 2025-05-07T20:30:28.8149297Z 2025-05-07T20:30:28.8149505Z colorama conda-forge/noarch::colorama-0.4.6-pyhd8ed1ab_1 2025-05-07T20:30:28.8150153Z exceptiongroup conda-forge/noarch::exceptiongroup-1.2.2-pyhd8ed1ab_1 2025-05-07T20:30:28.8150700Z expecttest conda-forge/noarch::expecttest-0.3.0-pyhd8ed1ab_0 2025-05-07T20:30:28.8151186Z iniconfig conda-forge/noarch::iniconfig-2.0.0-pyhd8ed1ab_1 2025-05-07T20:30:28.8151673Z packaging conda-forge/noarch::packaging-25.0-pyh29332c3_1 2025-05-07T20:30:28.8152141Z pluggy conda-forge/noarch::pluggy-1.5.0-pyhd8ed1ab_1 2025-05-07T20:30:28.8152591Z pytest conda-forge/noarch::pytest-8.3.5-pyhd8ed1ab_0 2025-05-07T20:30:28.8153021Z tomli conda-forge/noarch::tomli-2.2.1-pyhd8ed1ab_1 2025-05-07T20:30:28.8153295Z 2025-05-07T20:30:28.8153299Z 2025-05-07T20:30:28.8153303Z 2025-05-07T20:30:28.8153452Z Downloading and Extracting Packages: ...working... 2025-05-07T20:30:28.8153829Z pytest-8.3.5 | 254 KB | | 0% 2025-05-07T20:30:28.8154060Z 2025-05-07T20:30:28.8154521Z packaging-25.0 | 61 KB | | 0%  2025-05-07T20:30:28.8154793Z 2025-05-07T20:30:28.8154796Z 2025-05-07T20:30:28.8156691Z colorama-0.4.6 | 26 KB | | 0%  2025-05-07T20:30:28.8156947Z 2025-05-07T20:30:28.8156951Z 2025-05-07T20:30:28.8167688Z 2025-05-07T20:30:28.8183326Z pluggy-1.5.0 | 23 KB | | 0%  2025-05-07T20:30:28.8183832Z 2025-05-07T20:30:28.8183837Z 2025-05-07T20:30:28.8183840Z 2025-05-07T20:30:28.8183844Z 2025-05-07T20:30:28.8194817Z exceptiongroup-1.2.2 | 20 KB | | 0%  2025-05-07T20:30:28.8195128Z 2025-05-07T20:30:28.8195132Z 2025-05-07T20:30:28.8195136Z 2025-05-07T20:30:28.8195140Z 2025-05-07T20:30:28.8195143Z 2025-05-07T20:30:28.8196027Z tomli-2.2.1 | 19 KB | | 0%  2025-05-07T20:30:28.8196288Z 2025-05-07T20:30:28.8196292Z 2025-05-07T20:30:28.8196295Z 2025-05-07T20:30:28.8196299Z 2025-05-07T20:30:28.8196302Z 2025-05-07T20:30:28.8196306Z 2025-05-07T20:30:28.8198330Z expecttest-0.3.0 | 14 KB | | 0%  2025-05-07T20:30:28.8198615Z 2025-05-07T20:30:28.8198619Z 2025-05-07T20:30:28.8198623Z 2025-05-07T20:30:28.8198626Z 2025-05-07T20:30:28.8198630Z 2025-05-07T20:30:28.8198634Z 2025-05-07T20:30:28.8198637Z 2025-05-07T20:30:28.8991412Z iniconfig-2.0.0 | 11 KB | | 0%  2025-05-07T20:30:28.8991712Z 2025-05-07T20:30:28.8991716Z 2025-05-07T20:30:28.8991720Z 2025-05-07T20:30:28.9898281Z pluggy-1.5.0 | 23 KB | ########## | 100%  2025-05-07T20:30:28.9898565Z 2025-05-07T20:30:28.9898690Z 2025-05-07T20:30:28.9898859Z 2025-05-07T20:30:28.9898868Z 2025-05-07T20:30:28.9898877Z 2025-05-07T20:30:29.0166179Z tomli-2.2.1 | 19 KB | ########5 | 85%  2025-05-07T20:30:29.0166466Z 2025-05-07T20:30:29.0166470Z 2025-05-07T20:30:29.0166482Z 2025-05-07T20:30:29.0166485Z 2025-05-07T20:30:29.0168841Z 2025-05-07T20:30:29.1311424Z tomli-2.2.1 | 19 KB | ########## | 100%  2025-05-07T20:30:29.1311972Z 2025-05-07T20:30:29.1311993Z 2025-05-07T20:30:29.1312001Z 2025-05-07T20:30:29.1312008Z 2025-05-07T20:30:29.1312015Z 2025-05-07T20:30:29.1312023Z 2025-05-07T20:30:29.1768010Z expecttest-0.3.0 | 14 KB | ########## | 100%  2025-05-07T20:30:29.1768349Z 2025-05-07T20:30:29.1768353Z 2025-05-07T20:30:29.1768357Z 2025-05-07T20:30:29.1768360Z 2025-05-07T20:30:29.1768364Z 2025-05-07T20:30:29.1769742Z 2025-05-07T20:30:29.1780270Z expecttest-0.3.0 | 14 KB | ########## | 100%  2025-05-07T20:30:29.2000159Z pytest-8.3.5 | 254 KB | 6 | 6% 2025-05-07T20:30:29.2093158Z pytest-8.3.5 | 254 KB | ########## | 100% 2025-05-07T20:30:29.2093498Z 2025-05-07T20:30:29.2093504Z 2025-05-07T20:30:29.2093509Z 2025-05-07T20:30:29.2099424Z pluggy-1.5.0 | 23 KB | ########## | 100%  2025-05-07T20:30:29.2099757Z 2025-05-07T20:30:29.2099762Z 2025-05-07T20:30:29.2099766Z 2025-05-07T20:30:29.2142506Z pluggy-1.5.0 | 23 KB | ########## | 100%  2025-05-07T20:30:29.2142796Z 2025-05-07T20:30:29.2142800Z 2025-05-07T20:30:29.2142803Z 2025-05-07T20:30:29.2142807Z 2025-05-07T20:30:29.2142810Z 2025-05-07T20:30:29.2142814Z 2025-05-07T20:30:29.2149321Z expecttest-0.3.0 | 14 KB | ########## | 100%  2025-05-07T20:30:29.2149725Z 2025-05-07T20:30:29.2149730Z 2025-05-07T20:30:29.2149734Z 2025-05-07T20:30:29.2149737Z 2025-05-07T20:30:29.2149741Z 2025-05-07T20:30:29.2149744Z 2025-05-07T20:30:29.2151090Z 2025-05-07T20:30:29.2158122Z iniconfig-2.0.0 | 11 KB | ########## | 100%  2025-05-07T20:30:29.2158413Z 2025-05-07T20:30:29.2158418Z 2025-05-07T20:30:29.2158422Z 2025-05-07T20:30:29.2158425Z 2025-05-07T20:30:29.2159161Z 2025-05-07T20:30:29.2161999Z tomli-2.2.1 | 19 KB | ########## | 100%  2025-05-07T20:30:29.2162331Z 2025-05-07T20:30:29.2162335Z 2025-05-07T20:30:29.2162339Z 2025-05-07T20:30:29.2162342Z 2025-05-07T20:30:29.2162346Z 2025-05-07T20:30:29.2162349Z 2025-05-07T20:30:29.2162353Z 2025-05-07T20:30:29.2293911Z iniconfig-2.0.0 | 11 KB | ########## | 100%  2025-05-07T20:30:29.2294224Z 2025-05-07T20:30:29.2294228Z 2025-05-07T20:30:29.2294232Z 2025-05-07T20:30:29.2294423Z 2025-05-07T20:30:29.2294426Z 2025-05-07T20:30:29.2294430Z 2025-05-07T20:30:29.2294434Z 2025-05-07T20:30:29.2402970Z iniconfig-2.0.0 | 11 KB | ########## | 100%  2025-05-07T20:30:29.2403266Z 2025-05-07T20:30:29.2403270Z 2025-05-07T20:30:29.2403274Z 2025-05-07T20:30:29.2403278Z 2025-05-07T20:30:29.2409580Z exceptiongroup-1.2.2 | 20 KB | #######9 | 80%  2025-05-07T20:30:29.2409954Z 2025-05-07T20:30:29.2409959Z 2025-05-07T20:30:29.2409962Z 2025-05-07T20:30:29.2409966Z 2025-05-07T20:30:29.2491768Z exceptiongroup-1.2.2 | 20 KB | ########## | 100%  2025-05-07T20:30:29.2492129Z 2025-05-07T20:30:29.2492133Z 2025-05-07T20:30:29.2516486Z colorama-0.4.6 | 26 KB | ###### | 61%  2025-05-07T20:30:29.2516822Z 2025-05-07T20:30:29.2516826Z 2025-05-07T20:30:29.2529422Z colorama-0.4.6 | 26 KB | ########## | 100%  2025-05-07T20:30:29.2590897Z pytest-8.3.5 | 254 KB | ########## | 100% 2025-05-07T20:30:29.2591464Z 2025-05-07T20:30:29.2591473Z 2025-05-07T20:30:29.2591480Z 2025-05-07T20:30:29.2591488Z 2025-05-07T20:30:29.2648196Z exceptiongroup-1.2.2 | 20 KB | ########## | 100%  2025-05-07T20:30:29.2648515Z 2025-05-07T20:30:29.2648519Z 2025-05-07T20:30:29.2754137Z colorama-0.4.6 | 26 KB | ########## | 100%  2025-05-07T20:30:29.2754490Z 2025-05-07T20:30:29.2769926Z packaging-25.0 | 61 KB | ##6 | 26%  2025-05-07T20:30:29.2770189Z 2025-05-07T20:30:29.2856138Z packaging-25.0 | 61 KB | ########## | 100%  2025-05-07T20:30:29.2856398Z 2025-05-07T20:30:29.2863101Z packaging-25.0 | 61 KB | ########## | 100%  2025-05-07T20:30:29.2863569Z 2025-05-07T20:30:29.2863856Z 2025-05-07T20:30:29.2864033Z  2025-05-07T20:30:29.2864248Z 2025-05-07T20:30:29.2864261Z 2025-05-07T20:30:29.2864428Z  2025-05-07T20:30:29.2864646Z 2025-05-07T20:30:29.2864650Z 2025-05-07T20:30:29.2864653Z 2025-05-07T20:30:29.2864832Z  2025-05-07T20:30:29.2865304Z 2025-05-07T20:30:29.2865310Z 2025-05-07T20:30:29.2865315Z 2025-05-07T20:30:29.2865319Z 2025-05-07T20:30:29.2865526Z  2025-05-07T20:30:29.2865747Z 2025-05-07T20:30:29.2865751Z 2025-05-07T20:30:29.2865755Z 2025-05-07T20:30:29.2865759Z 2025-05-07T20:30:29.2865762Z 2025-05-07T20:30:29.2865949Z  2025-05-07T20:30:29.2866165Z 2025-05-07T20:30:29.2866168Z 2025-05-07T20:30:29.2866172Z 2025-05-07T20:30:29.2866176Z 2025-05-07T20:30:29.2866179Z 2025-05-07T20:30:29.2866183Z 2025-05-07T20:30:29.2866368Z  2025-05-07T20:30:29.2866588Z 2025-05-07T20:30:29.2866596Z 2025-05-07T20:30:29.2866600Z 2025-05-07T20:30:29.2866603Z 2025-05-07T20:30:29.2866607Z 2025-05-07T20:30:29.2866611Z 2025-05-07T20:30:29.2866614Z 2025-05-07T20:30:29.2866824Z  done 2025-05-07T20:30:29.3870860Z Preparing transaction: \ done 2025-05-07T20:30:29.4875729Z Verifying transaction: / done 2025-05-07T20:30:31.3902237Z Executing transaction: \ | / - \ | / - \ | / - \ | / - \ | / done 2025-05-07T20:30:31.5216265Z [TEST] Checking imports ... 2025-05-07T20:30:35.4487811Z [CHECK] Python (sub-)package 'fbgemm_gpu' found ... 2025-05-07T20:30:35.4499735Z [TEST] Setting feature flags ... 2025-05-07T20:30:35.4500166Z + conda env config vars set -n build_binary FBGEMM_TBE_ENSEMBLE_ROWWISE_ADAGRAD=1 2025-05-07T20:30:35.4500530Z 2025-05-07T20:30:35.8694660Z 2025-05-07T20:30:35.8695058Z [TEST] PyTest args: -v -rsx -s -W ignore::pytest.PytestCollectionWarning 2025-05-07T20:30:35.8696669Z ################################################################################ 2025-05-07T20:30:35.8697307Z # Run FBGEMM-GPU Tests: 2025-05-07T20:30:35.8697558Z # 2025-05-07T20:30:35.8716641Z # [2025-05-07T20:30:35.871Z] + __run_fbgemm_gpu_tests_in_directory build_binary 2025-05-07T20:30:35.8717077Z ################################################################################ 2025-05-07T20:30:35.8717299Z 2025-05-07T20:30:35.8724427Z [TEST] Enumerating ALL test files ... 2025-05-07T20:30:35.8753237Z ./attention/gqa_test.py 2025-05-07T20:30:35.8753528Z ./coalesce/coalesce_test.py 2025-05-07T20:30:35.8753804Z ./comm/multi_gpu_car_test.py 2025-05-07T20:30:35.8754089Z ./gather_scatter/gather_scatter_test.py 2025-05-07T20:30:35.8754404Z ./kv_cache/kv_cache_test.py 2025-05-07T20:30:35.8754667Z ./moe/activation_test.py 2025-05-07T20:30:35.8754922Z ./moe/gather_scatter_test.py 2025-05-07T20:30:35.8755184Z ./moe/layers_test.py 2025-05-07T20:30:35.8755425Z ./moe/shuffling_test.py 2025-05-07T20:30:35.8755670Z ./quantize/quantize_test.py 2025-05-07T20:30:35.8755862Z 2025-05-07T20:30:35.8755980Z [TEST] Enumerating IGNORED test files ... 2025-05-07T20:30:35.8756207Z 2025-05-07T20:30:35.8773450Z ################################################################################ 2025-05-07T20:30:35.8789000Z # [2025-05-07T20:30:35.878Z] Run Python Test Suite: 2025-05-07T20:30:35.8789328Z # ./attention/gqa_test.py 2025-05-07T20:30:35.8789605Z ################################################################################ 2025-05-07T20:30:35.8812828Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./attention/gqa_test.py 2025-05-07T20:30:35.8813483Z 2025-05-07T20:30:38.4237855Z ============================= test session starts ============================== 2025-05-07T20:30:38.4238537Z platform linux -- Python 3.9.18, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python 2025-05-07T20:30:38.4239096Z cachedir: .pytest_cache 2025-05-07T20:30:38.4239729Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,) 2025-05-07T20:30:38.4240774Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu 2025-05-07T20:30:38.4241212Z plugins: hypothesis-6.131.14 2025-05-07T20:30:39.9589494Z collecting ... collected 2 items 2025-05-07T20:30:39.9589828Z 2025-05-07T20:31:16.4684646Z attention/gqa_test.py::Int4GQATest::test_gqa Trying example: test_gqa( 2025-05-07T20:31:16.4685235Z self=, 2025-05-07T20:31:16.4685748Z int4_kv=False, 2025-05-07T20:31:16.4686004Z num_groups=1, 2025-05-07T20:31:16.4686253Z B=1, 2025-05-07T20:31:16.4686479Z MAX_T=4, 2025-05-07T20:31:16.4686705Z N_H_L=1, 2025-05-07T20:31:16.4686940Z ) 2025-05-07T20:31:16.4687170Z Trying example: test_gqa( 2025-05-07T20:31:16.4687520Z self=, 2025-05-07T20:31:16.4687906Z int4_kv=True, 2025-05-07T20:31:16.4688184Z num_groups=1, 2025-05-07T20:31:16.4688421Z B=1, 2025-05-07T20:31:16.4688640Z MAX_T=4, 2025-05-07T20:31:16.4688872Z N_H_L=1, 2025-05-07T20:31:16.4689091Z ) 2025-05-07T20:31:16.4689332Z Trying example: test_gqa( 2025-05-07T20:31:16.4689680Z self=, 2025-05-07T20:31:16.4690053Z int4_kv=True, 2025-05-07T20:31:16.4690300Z num_groups=4, 2025-05-07T20:31:16.4690545Z B=23, 2025-05-07T20:31:16.4690759Z MAX_T=33, 2025-05-07T20:31:16.4690993Z N_H_L=68, 2025-05-07T20:31:16.4691221Z ) 2025-05-07T20:31:16.4691443Z Trying example: test_gqa( 2025-05-07T20:31:16.4691789Z self=, 2025-05-07T20:31:16.4692169Z int4_kv=True, 2025-05-07T20:31:16.4692408Z num_groups=4, 2025-05-07T20:31:16.4692651Z B=77, 2025-05-07T20:31:16.4692872Z MAX_T=4, 2025-05-07T20:31:16.4693096Z N_H_L=1, 2025-05-07T20:31:16.4693319Z ) 2025-05-07T20:31:16.4693549Z Trying example: test_gqa( 2025-05-07T20:31:16.4694316Z self=, 2025-05-07T20:31:16.4694704Z int4_kv=True, 2025-05-07T20:31:16.4694953Z num_groups=4, 2025-05-07T20:31:16.4695197Z B=77, 2025-05-07T20:31:16.4695418Z MAX_T=52, 2025-05-07T20:31:16.4695651Z N_H_L=67, 2025-05-07T20:31:16.4695879Z ) 2025-05-07T20:31:16.4696100Z Trying example: test_gqa( 2025-05-07T20:31:16.4696448Z self=, 2025-05-07T20:31:16.4696830Z int4_kv=False, 2025-05-07T20:31:16.4697074Z num_groups=4, 2025-05-07T20:31:16.4697318Z B=57, 2025-05-07T20:31:16.4697538Z MAX_T=45, 2025-05-07T20:31:16.4697771Z N_H_L=120, 2025-05-07T20:31:16.4698008Z ) 2025-05-07T20:31:16.4698242Z Trying example: test_gqa( 2025-05-07T20:31:16.4698586Z self=, 2025-05-07T20:31:16.4698970Z int4_kv=True, 2025-05-07T20:31:16.4699222Z num_groups=4, 2025-05-07T20:31:16.4699462Z B=52, 2025-05-07T20:31:16.4699695Z MAX_T=42, 2025-05-07T20:31:16.4699929Z N_H_L=53, 2025-05-07T20:31:16.4700152Z ) 2025-05-07T20:31:16.4700381Z Trying example: test_gqa( 2025-05-07T20:31:16.4700731Z self=, 2025-05-07T20:31:16.4701112Z int4_kv=True, 2025-05-07T20:31:16.4701362Z num_groups=1, 2025-05-07T20:31:16.4701607Z B=77, 2025-05-07T20:31:16.4701822Z MAX_T=95, 2025-05-07T20:31:16.4702053Z N_H_L=53, 2025-05-07T20:31:16.4702282Z ) 2025-05-07T20:31:16.4702504Z Trying example: test_gqa( 2025-05-07T20:31:16.4702851Z self=, 2025-05-07T20:31:16.4703228Z int4_kv=True, 2025-05-07T20:31:16.4703480Z num_groups=4, 2025-05-07T20:31:16.4703718Z B=113, 2025-05-07T20:31:16.4703944Z MAX_T=48, 2025-05-07T20:31:16.4704176Z N_H_L=96, 2025-05-07T20:31:16.4704395Z ) 2025-05-07T20:31:16.4704624Z Trying example: test_gqa( 2025-05-07T20:31:16.4704971Z self=, 2025-05-07T20:31:16.4705357Z int4_kv=False, 2025-05-07T20:31:16.4705611Z num_groups=1, 2025-05-07T20:31:16.4705858Z B=51, 2025-05-07T20:31:16.4706075Z MAX_T=61, 2025-05-07T20:31:16.4706307Z N_H_L=69, 2025-05-07T20:31:16.4706735Z ) 2025-05-07T20:31:16.4706964Z Trying example: test_gqa( 2025-05-07T20:31:16.4707315Z self=, 2025-05-07T20:31:16.4707699Z int4_kv=False, 2025-05-07T20:31:16.4707949Z num_groups=4, 2025-05-07T20:31:16.4708194Z B=17, 2025-05-07T20:31:16.4708416Z MAX_T=113, 2025-05-07T20:31:16.4708643Z N_H_L=65, 2025-05-07T20:31:16.4708872Z ) 2025-05-07T20:31:16.4709102Z Trying example: test_gqa( 2025-05-07T20:31:16.4709444Z self=, 2025-05-07T20:31:16.4709995Z int4_kv=False, 2025-05-07T20:31:16.4710253Z num_groups=4, 2025-05-07T20:31:16.4710491Z B=17, 2025-05-07T20:31:16.4710713Z MAX_T=65, 2025-05-07T20:31:16.4710949Z N_H_L=65, 2025-05-07T20:31:16.4711169Z ) 2025-05-07T20:31:16.4711403Z Trying example: test_gqa( 2025-05-07T20:31:16.4711752Z self=, 2025-05-07T20:31:16.4712137Z int4_kv=False, 2025-05-07T20:31:16.4712385Z num_groups=4, 2025-05-07T20:31:16.4712638Z B=65, 2025-05-07T20:31:16.4712865Z MAX_T=65, 2025-05-07T20:31:16.4713091Z N_H_L=65, 2025-05-07T20:31:16.4713315Z ) 2025-05-07T20:31:16.4713543Z Trying example: test_gqa( 2025-05-07T20:31:16.4713879Z self=, 2025-05-07T20:31:16.4714260Z int4_kv=False, 2025-05-07T20:31:16.4714512Z num_groups=1, 2025-05-07T20:31:16.4714749Z B=6, 2025-05-07T20:31:16.4714971Z MAX_T=108, 2025-05-07T20:31:16.4715209Z N_H_L=14, 2025-05-07T20:31:16.4715434Z ) 2025-05-07T20:31:16.4715664Z Trying example: test_gqa( 2025-05-07T20:31:16.4716009Z self=, 2025-05-07T20:31:16.4716385Z int4_kv=False, 2025-05-07T20:31:16.4716635Z num_groups=1, 2025-05-07T20:31:16.4716973Z B=6, 2025-05-07T20:31:16.4717185Z MAX_T=14, 2025-05-07T20:31:16.4717414Z N_H_L=14, 2025-05-07T20:31:16.4717644Z ) 2025-05-07T20:31:16.4717864Z Trying example: test_gqa( 2025-05-07T20:31:16.4718214Z self=, 2025-05-07T20:31:16.4718592Z int4_kv=False, 2025-05-07T20:31:16.4718836Z num_groups=1, 2025-05-07T20:31:16.4719080Z B=6, 2025-05-07T20:31:16.4719299Z MAX_T=6, 2025-05-07T20:31:16.4719523Z N_H_L=14, 2025-05-07T20:31:16.4719755Z ) 2025-05-07T20:31:16.4719983Z Trying example: test_gqa( 2025-05-07T20:31:16.4720322Z self=, 2025-05-07T20:31:16.4720706Z int4_kv=False, 2025-05-07T20:31:16.4720954Z num_groups=1, 2025-05-07T20:31:16.4721195Z B=6, 2025-05-07T20:31:16.4721409Z MAX_T=6, 2025-05-07T20:31:16.4721637Z N_H_L=6, 2025-05-07T20:31:16.4721860Z ) 2025-05-07T20:31:16.4722079Z Trying example: test_gqa( 2025-05-07T20:31:16.4722423Z self=, 2025-05-07T20:31:16.4722814Z int4_kv=False, 2025-05-07T20:31:16.4723058Z num_groups=1, 2025-05-07T20:31:16.4723300Z B=70, 2025-05-07T20:31:16.4723521Z MAX_T=94, 2025-05-07T20:31:16.4723743Z N_H_L=78, 2025-05-07T20:31:16.4723979Z ) 2025-05-07T20:31:16.4724210Z Trying example: test_gqa( 2025-05-07T20:31:16.4724548Z self=, 2025-05-07T20:31:16.4724932Z int4_kv=False, 2025-05-07T20:31:16.4725181Z num_groups=1, 2025-05-07T20:31:16.4725418Z B=78, 2025-05-07T20:31:16.4725653Z MAX_T=94, 2025-05-07T20:31:16.4725887Z N_H_L=78, 2025-05-07T20:31:16.4726105Z ) 2025-05-07T20:31:16.4726333Z Trying example: test_gqa( 2025-05-07T20:31:16.4726685Z self=, 2025-05-07T20:31:16.4727059Z int4_kv=False, 2025-05-07T20:31:16.4727311Z num_groups=1, 2025-05-07T20:31:16.4727550Z B=94, 2025-05-07T20:31:16.4737817Z MAX_T=94, 2025-05-07T20:31:16.4738033Z N_H_L=78, 2025-05-07T20:31:16.4738243Z ) 2025-05-07T20:31:16.4738437Z Trying example: test_gqa( 2025-05-07T20:31:16.4738746Z self=, 2025-05-07T20:31:16.4739071Z int4_kv=False, 2025-05-07T20:31:16.4739414Z num_groups=1, 2025-05-07T20:31:16.4739622Z B=94, 2025-05-07T20:31:16.4739799Z MAX_T=94, 2025-05-07T20:31:16.4739994Z N_H_L=94, 2025-05-07T20:31:16.4740185Z ) 2025-05-07T20:31:16.4740368Z Trying example: test_gqa( 2025-05-07T20:31:16.4740673Z self=, 2025-05-07T20:31:16.4741000Z int4_kv=False, 2025-05-07T20:31:16.4741206Z num_groups=4, 2025-05-07T20:31:16.4741416Z B=41, 2025-05-07T20:31:16.4741604Z MAX_T=105, 2025-05-07T20:31:16.4741796Z N_H_L=126, 2025-05-07T20:31:16.4741988Z ) 2025-05-07T20:31:16.4742175Z Trying example: test_gqa( 2025-05-07T20:31:16.4742471Z self=, 2025-05-07T20:31:16.4742785Z int4_kv=False, 2025-05-07T20:31:16.4742995Z num_groups=4, 2025-05-07T20:31:16.4743208Z B=105, 2025-05-07T20:31:16.4743387Z MAX_T=105, 2025-05-07T20:31:16.4743586Z N_H_L=126, 2025-05-07T20:31:16.4743780Z ) 2025-05-07T20:31:16.4743961Z Trying example: test_gqa( 2025-05-07T20:31:16.4744253Z self=, 2025-05-07T20:31:16.4744564Z int4_kv=False, 2025-05-07T20:31:16.4744765Z num_groups=4, 2025-05-07T20:31:16.4744964Z B=105, 2025-05-07T20:31:16.4745141Z MAX_T=105, 2025-05-07T20:31:16.4745332Z N_H_L=105, 2025-05-07T20:31:16.4745532Z ) 2025-05-07T20:31:16.4745740Z Trying example: test_gqa( 2025-05-07T20:31:16.4746040Z self=, 2025-05-07T20:31:16.4746354Z int4_kv=True, 2025-05-07T20:31:16.4746556Z num_groups=1, 2025-05-07T20:31:16.4746759Z B=95, 2025-05-07T20:31:16.4746940Z MAX_T=114, 2025-05-07T20:31:16.4747128Z N_H_L=43, 2025-05-07T20:31:16.4747314Z ) 2025-05-07T20:31:16.4747505Z Trying example: test_gqa( 2025-05-07T20:31:16.4747885Z self=, 2025-05-07T20:31:16.4748204Z int4_kv=True, 2025-05-07T20:31:16.4748410Z num_groups=1, 2025-05-07T20:31:16.4748605Z B=43, 2025-05-07T20:31:16.4748785Z MAX_T=114, 2025-05-07T20:31:16.4748987Z N_H_L=43, 2025-05-07T20:31:16.4749166Z ) 2025-05-07T20:31:16.4749350Z Trying example: test_gqa( 2025-05-07T20:31:16.4749634Z self=, 2025-05-07T20:31:16.4750059Z int4_kv=True, 2025-05-07T20:31:16.4750258Z num_groups=1, 2025-05-07T20:31:16.4750457Z B=43, 2025-05-07T20:31:16.4750635Z MAX_T=43, 2025-05-07T20:31:16.4750823Z N_H_L=43, 2025-05-07T20:31:16.4751010Z ) 2025-05-07T20:31:16.4751196Z Trying example: test_gqa( 2025-05-07T20:31:16.4751479Z self=, 2025-05-07T20:31:16.4751797Z int4_kv=False, 2025-05-07T20:31:16.4752000Z num_groups=1, 2025-05-07T20:31:16.4752193Z B=21, 2025-05-07T20:31:16.4752376Z MAX_T=38, 2025-05-07T20:31:16.4752575Z N_H_L=42, 2025-05-07T20:31:16.4752756Z ) 2025-05-07T20:31:16.4752943Z Trying example: test_gqa( 2025-05-07T20:31:16.4753233Z self=, 2025-05-07T20:31:16.4753550Z int4_kv=False, 2025-05-07T20:31:16.4753756Z num_groups=1, 2025-05-07T20:31:16.4753957Z B=38, 2025-05-07T20:31:16.4754132Z MAX_T=38, 2025-05-07T20:31:16.4754323Z N_H_L=42, 2025-05-07T20:31:16.4754508Z ) 2025-05-07T20:31:16.4754690Z Trying example: test_gqa( 2025-05-07T20:31:16.4754983Z self=, 2025-05-07T20:31:16.4755303Z int4_kv=False, 2025-05-07T20:31:16.4755507Z num_groups=1, 2025-05-07T20:31:16.4755711Z B=38, 2025-05-07T20:31:16.4755893Z MAX_T=42, 2025-05-07T20:31:16.4756073Z N_H_L=42, 2025-05-07T20:31:16.4756263Z ) 2025-05-07T20:31:16.4756452Z Trying example: test_gqa( 2025-05-07T20:31:16.4756735Z self=, 2025-05-07T20:31:16.4757051Z int4_kv=False, 2025-05-07T20:31:16.4757266Z num_groups=1, 2025-05-07T20:31:16.4757458Z B=42, 2025-05-07T20:31:16.4757636Z MAX_T=42, 2025-05-07T20:31:16.4757822Z N_H_L=42, 2025-05-07T20:31:16.4757998Z ) 2025-05-07T20:31:16.4758279Z Trying example: test_gqa( 2025-05-07T20:31:16.4758569Z self=, 2025-05-07T20:31:16.4758882Z int4_kv=True, 2025-05-07T20:31:16.4759082Z num_groups=1, 2025-05-07T20:31:16.4759281Z B=74, 2025-05-07T20:31:16.4759466Z MAX_T=20, 2025-05-07T20:31:16.4759651Z N_H_L=15, 2025-05-07T20:31:16.4759837Z ) 2025-05-07T20:31:16.4760023Z Trying example: test_gqa( 2025-05-07T20:31:16.4760303Z self=, 2025-05-07T20:31:16.4760618Z int4_kv=True, 2025-05-07T20:31:16.4760827Z num_groups=1, 2025-05-07T20:31:16.4761021Z B=20, 2025-05-07T20:31:16.4761202Z MAX_T=20, 2025-05-07T20:31:16.4761394Z N_H_L=15, 2025-05-07T20:31:16.4761576Z ) 2025-05-07T20:31:16.4761762Z Trying example: test_gqa( 2025-05-07T20:31:16.4762049Z self=, 2025-05-07T20:31:16.4762358Z int4_kv=True, 2025-05-07T20:31:16.4762561Z num_groups=1, 2025-05-07T20:31:16.4762758Z B=20, 2025-05-07T20:31:16.4762939Z MAX_T=15, 2025-05-07T20:31:16.4763134Z N_H_L=15, 2025-05-07T20:31:16.4763327Z ) 2025-05-07T20:31:16.4763512Z Trying example: test_gqa( 2025-05-07T20:31:16.4763800Z self=, 2025-05-07T20:31:16.4764118Z int4_kv=True, 2025-05-07T20:31:16.4764320Z num_groups=1, 2025-05-07T20:31:16.4764525Z B=15, 2025-05-07T20:31:16.4764710Z MAX_T=20, 2025-05-07T20:31:16.4764894Z N_H_L=15, 2025-05-07T20:31:16.4765081Z ) 2025-05-07T20:31:16.4765268Z Trying example: test_gqa( 2025-05-07T20:31:16.4765581Z self=, 2025-05-07T20:31:16.4765918Z int4_kv=True, 2025-05-07T20:31:16.4766124Z num_groups=1, 2025-05-07T20:31:16.4766323Z B=15, 2025-05-07T20:31:16.4766499Z MAX_T=15, 2025-05-07T20:31:16.4766783Z N_H_L=15, 2025-05-07T20:31:16.4766974Z ) 2025-05-07T20:31:16.4767153Z Trying example: test_gqa( 2025-05-07T20:31:16.4767439Z self=, 2025-05-07T20:31:16.4767761Z int4_kv=False, 2025-05-07T20:31:16.4767962Z num_groups=4, 2025-05-07T20:31:16.4768161Z B=117, 2025-05-07T20:31:16.4768343Z MAX_T=104, 2025-05-07T20:31:16.4768530Z N_H_L=69, 2025-05-07T20:31:16.4768716Z ) 2025-05-07T20:31:16.4768903Z Trying example: test_gqa( 2025-05-07T20:31:16.4769186Z self=, 2025-05-07T20:31:16.4769508Z int4_kv=False, 2025-05-07T20:31:16.4769715Z num_groups=4, 2025-05-07T20:31:16.4769906Z B=117, 2025-05-07T20:31:16.4770092Z MAX_T=117, 2025-05-07T20:31:16.4770284Z N_H_L=69, 2025-05-07T20:31:16.4770461Z ) 2025-05-07T20:31:16.4770645Z Trying example: test_gqa( 2025-05-07T20:31:16.4770934Z self=, 2025-05-07T20:31:16.4771254Z int4_kv=False, 2025-05-07T20:31:16.4771463Z num_groups=4, 2025-05-07T20:31:16.4771663Z B=69, 2025-05-07T20:31:16.4771844Z MAX_T=117, 2025-05-07T20:31:16.4772040Z N_H_L=69, 2025-05-07T20:31:16.4772229Z ) 2025-05-07T20:31:16.4772423Z Trying example: test_gqa( 2025-05-07T20:31:16.4772705Z self=, 2025-05-07T20:31:16.4773024Z int4_kv=False, 2025-05-07T20:31:16.4773232Z num_groups=4, 2025-05-07T20:31:16.4773427Z B=117, 2025-05-07T20:31:16.4773609Z MAX_T=69, 2025-05-07T20:31:16.4773802Z N_H_L=69, 2025-05-07T20:31:16.4773982Z ) 2025-05-07T20:31:16.4774170Z PASSED 2025-05-07T20:31:16.5120976Z attention/gqa_test.py::Int4GQATest::test_mqa_main SKIPPED (Skip when...) 2025-05-07T20:31:16.5121322Z 2025-05-07T20:31:16.5121965Z =========================== short test summary info ============================ 2025-05-07T20:31:16.5122767Z SKIPPED [1] ../../../../../../../../miniconda/envs/build_binary/lib/python3.9/unittest/case.py:117: Skip when CUDA is not available or xformers is not available 2025-05-07T20:31:16.5123522Z ======================== 1 passed, 1 skipped in 38.60s ========================= 2025-05-07T20:31:17.1647784Z 2025-05-07T20:31:17.1648453Z [TEST] Python test suite PASSED: ./attention/gqa_test.py 2025-05-07T20:31:17.1669061Z [TEST] Python test time for ./attention/gqa_test.py: 42 seconds 2025-05-07T20:31:17.1669479Z 2025-05-07T20:31:17.1669486Z 2025-05-07T20:31:17.1669491Z 2025-05-07T20:31:17.1669496Z 2025-05-07T20:31:17.1691638Z ################################################################################ 2025-05-07T20:31:17.1707307Z # [2025-05-07T20:31:17.170Z] Run Python Test Suite: 2025-05-07T20:31:17.1707809Z # ./coalesce/coalesce_test.py 2025-05-07T20:31:17.1708205Z ################################################################################ 2025-05-07T20:31:17.1733223Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./coalesce/coalesce_test.py 2025-05-07T20:31:17.1734032Z 2025-05-07T20:31:19.3474705Z ============================= test session starts ============================== 2025-05-07T20:31:19.3475440Z platform linux -- Python 3.9.18, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python 2025-05-07T20:31:19.3476085Z cachedir: .pytest_cache 2025-05-07T20:31:19.3477354Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,) 2025-05-07T20:31:19.3478895Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu 2025-05-07T20:31:19.3479740Z plugins: hypothesis-6.131.14 2025-05-07T20:31:20.9050470Z collecting ... collected 1 item 2025-05-07T20:31:20.9050909Z 2025-05-07T20:31:21.6415220Z coalesce/coalesce_test.py::CoalesceTest::test_coalesce_batches PASSED 2025-05-07T20:31:21.6415662Z 2025-05-07T20:31:21.6415842Z ============================== 1 passed in 2.43s =============================== 2025-05-07T20:31:22.2824802Z 2025-05-07T20:31:22.2825244Z [TEST] Python test suite PASSED: ./coalesce/coalesce_test.py 2025-05-07T20:31:22.2846158Z [TEST] Python test time for ./coalesce/coalesce_test.py: 5 seconds 2025-05-07T20:31:22.2846513Z 2025-05-07T20:31:22.2846518Z 2025-05-07T20:31:22.2846522Z 2025-05-07T20:31:22.2846526Z 2025-05-07T20:31:22.2866572Z ################################################################################ 2025-05-07T20:31:22.2881996Z # [2025-05-07T20:31:22.287Z] Run Python Test Suite: 2025-05-07T20:31:22.2882338Z # ./comm/multi_gpu_car_test.py 2025-05-07T20:31:22.2882626Z ################################################################################ 2025-05-07T20:31:22.2907869Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./comm/multi_gpu_car_test.py 2025-05-07T20:31:22.2908552Z 2025-05-07T20:31:24.4461893Z ============================= test session starts ============================== 2025-05-07T20:31:24.4462601Z platform linux -- Python 3.9.18, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python 2025-05-07T20:31:24.4463154Z cachedir: .pytest_cache 2025-05-07T20:31:24.4463780Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,) 2025-05-07T20:31:24.4464560Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu 2025-05-07T20:31:24.4464989Z plugins: hypothesis-6.131.14 2025-05-07T20:31:26.0472050Z collecting ... collected 5 items 2025-05-07T20:31:26.0472295Z 2025-05-07T20:31:26.0483743Z comm/multi_gpu_car_test.py::LLamaMultiGpuTests::test_allgather SKIPPED 2025-05-07T20:31:26.0493104Z comm/multi_gpu_car_test.py::LLamaMultiGpuTests::test_allgather_dtype_mismatch SKIPPED 2025-05-07T20:31:26.0501779Z comm/multi_gpu_car_test.py::LLamaMultiGpuTests::test_allreduce SKIPPED 2025-05-07T20:31:26.0510422Z comm/multi_gpu_car_test.py::LLamaMultiGpuTests::test_oneshot_car_stress SKIPPED 2025-05-07T20:31:26.0528250Z comm/multi_gpu_car_test.py::LLamaMultiGpuTests::test_reducescatter SKIPPED 2025-05-07T20:31:26.0528721Z 2025-05-07T20:31:26.0529229Z =========================== short test summary info ============================ 2025-05-07T20:31:26.0530256Z SKIPPED [1] comm/multi_gpu_car_test.py:310: Skip when CUDA is not available or when there are not enough GPUs; these tests require at least two GPUs 2025-05-07T20:31:26.0531667Z SKIPPED [1] comm/multi_gpu_car_test.py:351: Skip when CUDA is not available or when there are not enough GPUs; these tests require at least two GPUs 2025-05-07T20:31:26.0533094Z SKIPPED [1] comm/multi_gpu_car_test.py:418: Skip when CUDA is not available or when there are not enough GPUs; these tests require at least two GPUs 2025-05-07T20:31:26.0534514Z SKIPPED [1] comm/multi_gpu_car_test.py:434: Skip when CUDA is not available or when there are not enough GPUs; these tests require at least two GPUs 2025-05-07T20:31:26.0535932Z SKIPPED [1] comm/multi_gpu_car_test.py:402: Skip when CUDA is not available or when there are not enough GPUs; these tests require at least two GPUs 2025-05-07T20:31:26.0536905Z ============================== 5 skipped in 1.74s ============================== 2025-05-07T20:31:26.5917118Z 2025-05-07T20:31:26.5917603Z [TEST] Python test suite PASSED: ./comm/multi_gpu_car_test.py 2025-05-07T20:31:26.5937774Z [TEST] Python test time for ./comm/multi_gpu_car_test.py: 4 seconds 2025-05-07T20:31:26.5938082Z 2025-05-07T20:31:26.5938091Z 2025-05-07T20:31:26.5938095Z 2025-05-07T20:31:26.5938108Z 2025-05-07T20:31:26.5958301Z ################################################################################ 2025-05-07T20:31:26.5973602Z # [2025-05-07T20:31:26.597Z] Run Python Test Suite: 2025-05-07T20:31:26.5974027Z # ./gather_scatter/gather_scatter_test.py 2025-05-07T20:31:26.5974461Z ################################################################################ 2025-05-07T20:31:26.6000238Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./gather_scatter/gather_scatter_test.py 2025-05-07T20:31:26.6001181Z 2025-05-07T20:31:28.7542109Z ============================= test session starts ============================== 2025-05-07T20:31:28.7542825Z platform linux -- Python 3.9.18, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python 2025-05-07T20:31:28.7543389Z cachedir: .pytest_cache 2025-05-07T20:31:28.7544002Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,) 2025-05-07T20:31:28.7544782Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu 2025-05-07T20:31:28.7545210Z plugins: hypothesis-6.131.14 2025-05-07T20:31:30.4353044Z collecting ... collected 2 items 2025-05-07T20:31:30.4353316Z 2025-05-07T20:31:30.4365132Z gather_scatter/gather_scatter_test.py::GatherScatterTests::test_gather_along_first_dim SKIPPED 2025-05-07T20:31:30.4379953Z gather_scatter/gather_scatter_test.py::GatherScatterTests::test_scatter_add_along_first_dim SKIPPED 2025-05-07T20:31:30.4380439Z 2025-05-07T20:31:30.4380607Z =========================== short test summary info ============================ 2025-05-07T20:31:30.4381273Z SKIPPED [1] gather_scatter/gather_scatter_test.py:29: Skip when no Hopper GPU is available. This test is only for Hopper GPU. 2025-05-07T20:31:30.4382164Z SKIPPED [1] gather_scatter/gather_scatter_test.py:70: Skip when no Hopper GPU is available. This test is only for Hopper GPU. 2025-05-07T20:31:30.4383059Z ============================== 2 skipped in 1.82s ============================== 2025-05-07T20:31:30.9999584Z 2025-05-07T20:31:30.9999962Z [TEST] Python test suite PASSED: ./gather_scatter/gather_scatter_test.py 2025-05-07T20:31:31.0021099Z [TEST] Python test time for ./gather_scatter/gather_scatter_test.py: 5 seconds 2025-05-07T20:31:31.0021452Z 2025-05-07T20:31:31.0021456Z 2025-05-07T20:31:31.0021461Z 2025-05-07T20:31:31.0021485Z 2025-05-07T20:31:31.0043119Z ################################################################################ 2025-05-07T20:31:31.0058684Z # [2025-05-07T20:31:31.005Z] Run Python Test Suite: 2025-05-07T20:31:31.0060100Z # ./kv_cache/kv_cache_test.py 2025-05-07T20:31:31.0060405Z ################################################################################ 2025-05-07T20:31:31.0083554Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./kv_cache/kv_cache_test.py 2025-05-07T20:31:31.0084217Z 2025-05-07T20:31:33.1570462Z ============================= test session starts ============================== 2025-05-07T20:31:33.1571289Z platform linux -- Python 3.9.18, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python 2025-05-07T20:31:33.1571856Z cachedir: .pytest_cache 2025-05-07T20:31:33.1572470Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,) 2025-05-07T20:31:33.1573278Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu 2025-05-07T20:31:33.1573712Z plugins: hypothesis-6.131.14 2025-05-07T20:31:34.7358208Z collecting ... collected 4 items 2025-05-07T20:31:34.7358545Z 2025-05-07T20:31:37.7740394Z kv_cache/kv_cache_test.py::KVCacheTests::test_fp8_kv_cache SKIPPED (...) 2025-05-07T20:31:37.7903226Z kv_cache/kv_cache_test.py::KVCacheTests::test_int4_kv_cache SKIPPED 2025-05-07T20:31:37.8096719Z kv_cache/kv_cache_test.py::KVCacheTests::test_positional_encoding_with_paged_attention SKIPPED 2025-05-07T20:31:37.8257335Z kv_cache/kv_cache_test.py::KVCacheTests::test_rope_positional_encoding_only SKIPPED 2025-05-07T20:31:37.8257892Z 2025-05-07T20:31:37.8258076Z =========================== short test summary info ============================ 2025-05-07T20:31:37.8258838Z SKIPPED [1] ../../../../../../../../miniconda/envs/build_binary/lib/python3.9/unittest/case.py:117: Skip when H100 is not available or MI300 is not available 2025-05-07T20:31:37.8260220Z SKIPPED [3] ../../../../../../../../miniconda/envs/build_binary/lib/python3.9/unittest/case.py:117: Skip when xformers is not available 2025-05-07T20:31:37.8260871Z ============================== 4 skipped in 4.80s ============================== 2025-05-07T20:31:39.5401834Z 2025-05-07T20:31:39.5402311Z [TEST] Python test suite PASSED: ./kv_cache/kv_cache_test.py 2025-05-07T20:31:39.5423088Z [TEST] Python test time for ./kv_cache/kv_cache_test.py: 8 seconds 2025-05-07T20:31:39.5423397Z 2025-05-07T20:31:39.5423402Z 2025-05-07T20:31:39.5423406Z 2025-05-07T20:31:39.5423409Z 2025-05-07T20:31:39.5442468Z ################################################################################ 2025-05-07T20:31:39.5457579Z # [2025-05-07T20:31:39.545Z] Run Python Test Suite: 2025-05-07T20:31:39.5457913Z # ./moe/activation_test.py 2025-05-07T20:31:39.5458198Z ################################################################################ 2025-05-07T20:31:39.5483886Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./moe/activation_test.py 2025-05-07T20:31:39.5484539Z 2025-05-07T20:31:41.7055721Z ============================= test session starts ============================== 2025-05-07T20:31:41.7056410Z platform linux -- Python 3.9.18, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python 2025-05-07T20:31:41.7056962Z cachedir: .pytest_cache 2025-05-07T20:31:41.7057588Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,) 2025-05-07T20:31:41.7058369Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu 2025-05-07T20:31:41.7058797Z plugins: hypothesis-6.131.14 2025-05-07T20:31:43.3624798Z TMA benchmarks will be running with experimental grid constant TMA descriptor. 2025-05-07T20:31:43.5767842Z collecting ... collected 2 items 2025-05-07T20:31:43.5768086Z 2025-05-07T20:31:49.5708350Z moe/activation_test.py::ActivationTests::test_silu_mul Trying example: test_silu_mul( 2025-05-07T20:31:49.5709641Z self=, 2025-05-07T20:31:49.5711044Z T=1, 2025-05-07T20:31:49.5711423Z D=5120, 2025-05-07T20:31:49.5711820Z contiguous=True, 2025-05-07T20:31:49.5712264Z compiled=True, 2025-05-07T20:31:49.5712661Z ) 2025-05-07T20:31:49.5713172Z Trying example: test_silu_mul( 2025-05-07T20:31:49.5714088Z self=, 2025-05-07T20:31:49.5714649Z T=4096, 2025-05-07T20:31:49.5714907Z D=5120, 2025-05-07T20:31:49.5715173Z contiguous=True, 2025-05-07T20:31:49.5715472Z compiled=True, 2025-05-07T20:31:49.5715707Z ) 2025-05-07T20:31:49.5715902Z Trying example: test_silu_mul( 2025-05-07T20:31:49.5716280Z self=, 2025-05-07T20:31:49.5716678Z T=4096, 2025-05-07T20:31:49.5716882Z D=7168, 2025-05-07T20:31:49.5717077Z contiguous=False, 2025-05-07T20:31:49.5717300Z compiled=False, 2025-05-07T20:31:49.5717506Z ) 2025-05-07T20:31:49.5717702Z Trying example: test_silu_mul( 2025-05-07T20:31:49.5718086Z self=, 2025-05-07T20:31:49.5718478Z T=4096, 2025-05-07T20:31:49.5718662Z D=5120, 2025-05-07T20:31:49.5718851Z contiguous=False, 2025-05-07T20:31:49.5719078Z compiled=True, 2025-05-07T20:31:49.5719283Z ) 2025-05-07T20:31:49.5719472Z Trying example: test_silu_mul( 2025-05-07T20:31:49.5719856Z self=, 2025-05-07T20:31:49.5720257Z T=1, 2025-05-07T20:31:49.5720432Z D=7168, 2025-05-07T20:31:49.5720630Z contiguous=True, 2025-05-07T20:31:49.5720853Z compiled=True, 2025-05-07T20:31:49.5721048Z ) 2025-05-07T20:31:49.5721242Z Trying example: test_silu_mul( 2025-05-07T20:31:49.5721622Z self=, 2025-05-07T20:31:49.5722209Z T=1, 2025-05-07T20:31:49.5722395Z D=7168, 2025-05-07T20:31:49.5722591Z contiguous=False, 2025-05-07T20:31:49.5722821Z compiled=True, 2025-05-07T20:31:49.5723027Z ) 2025-05-07T20:31:49.5723225Z Trying example: test_silu_mul( 2025-05-07T20:31:49.5723606Z self=, 2025-05-07T20:31:49.5723997Z T=4096, 2025-05-07T20:31:49.5724184Z D=5120, 2025-05-07T20:31:49.5724381Z contiguous=False, 2025-05-07T20:31:49.5724604Z compiled=False, 2025-05-07T20:31:49.5724811Z ) 2025-05-07T20:31:49.5725009Z Trying example: test_silu_mul( 2025-05-07T20:31:49.5725384Z self=, 2025-05-07T20:31:49.5727298Z T=1, 2025-05-07T20:31:49.5727485Z D=7168, 2025-05-07T20:31:49.5727681Z contiguous=True, 2025-05-07T20:31:49.5727906Z compiled=False, 2025-05-07T20:31:49.5728114Z ) 2025-05-07T20:31:49.5728303Z Trying example: test_silu_mul( 2025-05-07T20:31:49.5728692Z self=, 2025-05-07T20:31:49.5729087Z T=2048, 2025-05-07T20:31:49.5729273Z D=5120, 2025-05-07T20:31:49.5729471Z contiguous=True, 2025-05-07T20:31:49.5729709Z compiled=True, 2025-05-07T20:31:49.5729911Z ) 2025-05-07T20:31:49.5730111Z Trying example: test_silu_mul( 2025-05-07T20:31:49.5730496Z self=, 2025-05-07T20:31:49.5730895Z T=2048, 2025-05-07T20:31:49.5731078Z D=7168, 2025-05-07T20:31:49.5731273Z contiguous=True, 2025-05-07T20:31:49.5731501Z compiled=True, 2025-05-07T20:31:49.5731700Z ) 2025-05-07T20:31:49.5731900Z Trying example: test_silu_mul( 2025-05-07T20:31:49.5732284Z self=, 2025-05-07T20:31:49.5732673Z T=2048, 2025-05-07T20:31:49.5740988Z D=7168, 2025-05-07T20:31:49.5741200Z contiguous=True, 2025-05-07T20:31:49.5741448Z compiled=False, 2025-05-07T20:31:49.5741665Z ) 2025-05-07T20:31:49.5741861Z Trying example: test_silu_mul( 2025-05-07T20:31:49.5742263Z self=, 2025-05-07T20:31:49.5742798Z T=128, 2025-05-07T20:31:49.5742991Z D=5120, 2025-05-07T20:31:49.5743194Z contiguous=False, 2025-05-07T20:31:49.5743428Z compiled=True, 2025-05-07T20:31:49.5743635Z ) 2025-05-07T20:31:49.5743843Z Trying example: test_silu_mul( 2025-05-07T20:31:49.5744240Z self=, 2025-05-07T20:31:49.5744651Z T=128, 2025-05-07T20:31:49.5744845Z D=5120, 2025-05-07T20:31:49.5745048Z contiguous=True, 2025-05-07T20:31:49.5745287Z compiled=True, 2025-05-07T20:31:49.5745491Z ) 2025-05-07T20:31:49.5745700Z Trying example: test_silu_mul( 2025-05-07T20:31:49.5746094Z self=, 2025-05-07T20:31:49.5746491Z T=16384, 2025-05-07T20:31:49.5746705Z D=5120, 2025-05-07T20:31:49.5746910Z contiguous=False, 2025-05-07T20:31:49.5747144Z compiled=True, 2025-05-07T20:31:49.5747358Z ) 2025-05-07T20:31:49.5747562Z Trying example: test_silu_mul( 2025-05-07T20:31:49.5747951Z self=, 2025-05-07T20:31:49.5748349Z T=16384, 2025-05-07T20:31:49.5748543Z D=5120, 2025-05-07T20:31:49.5748744Z contiguous=False, 2025-05-07T20:31:49.5748975Z compiled=False, 2025-05-07T20:31:49.5749188Z ) 2025-05-07T20:31:49.5749386Z Trying example: test_silu_mul( 2025-05-07T20:31:49.5749912Z self=, 2025-05-07T20:31:49.5750317Z T=128, 2025-05-07T20:31:49.5750501Z D=7168, 2025-05-07T20:31:49.5750702Z contiguous=True, 2025-05-07T20:31:49.5750932Z compiled=False, 2025-05-07T20:31:49.5751139Z ) 2025-05-07T20:31:49.5751340Z Trying example: test_silu_mul( 2025-05-07T20:31:49.5751724Z self=, 2025-05-07T20:31:49.5752213Z T=128, 2025-05-07T20:31:49.5752398Z D=7168, 2025-05-07T20:31:49.5752590Z contiguous=False, 2025-05-07T20:31:49.5752812Z compiled=False, 2025-05-07T20:31:49.5753023Z ) 2025-05-07T20:31:49.5753219Z Trying example: test_silu_mul( 2025-05-07T20:31:49.5753614Z self=, 2025-05-07T20:31:49.5754032Z T=1, 2025-05-07T20:31:49.5754216Z D=5120, 2025-05-07T20:31:49.5754406Z contiguous=False, 2025-05-07T20:31:49.5754634Z compiled=False, 2025-05-07T20:31:49.5754840Z ) 2025-05-07T20:31:49.5755029Z Trying example: test_silu_mul( 2025-05-07T20:31:49.5755406Z self=, 2025-05-07T20:31:49.5755799Z T=1, 2025-05-07T20:31:49.5755985Z D=7168, 2025-05-07T20:31:49.5756176Z contiguous=False, 2025-05-07T20:31:49.5756401Z compiled=False, 2025-05-07T20:31:49.5756606Z ) 2025-05-07T20:31:49.5756806Z Trying example: test_silu_mul( 2025-05-07T20:31:49.5757182Z self=, 2025-05-07T20:31:49.5757574Z T=4096, 2025-05-07T20:31:49.5757753Z D=5120, 2025-05-07T20:31:49.5757952Z contiguous=True, 2025-05-07T20:31:49.5758184Z compiled=False, 2025-05-07T20:31:49.5758381Z ) 2025-05-07T20:31:49.5758577Z Trying example: test_silu_mul( 2025-05-07T20:31:49.5758957Z self=, 2025-05-07T20:31:49.5759341Z T=128, 2025-05-07T20:31:49.5759537Z D=7168, 2025-05-07T20:31:49.5759730Z contiguous=True, 2025-05-07T20:31:49.5759950Z compiled=True, 2025-05-07T20:31:49.5760155Z ) 2025-05-07T20:31:49.5760356Z Trying example: test_silu_mul( 2025-05-07T20:31:49.5760726Z self=, 2025-05-07T20:31:49.5761119Z T=1, 2025-05-07T20:31:49.5761309Z D=5120, 2025-05-07T20:31:49.5761501Z contiguous=False, 2025-05-07T20:31:49.5761735Z compiled=True, 2025-05-07T20:31:49.5761939Z ) 2025-05-07T20:31:49.5762136Z Trying example: test_silu_mul( 2025-05-07T20:31:49.5762507Z self=, 2025-05-07T20:31:49.5762996Z T=4096, 2025-05-07T20:31:49.5763184Z D=7168, 2025-05-07T20:31:49.5763371Z contiguous=True, 2025-05-07T20:31:49.5763595Z compiled=False, 2025-05-07T20:31:49.5763801Z ) 2025-05-07T20:31:49.5764002Z Trying example: test_silu_mul( 2025-05-07T20:31:49.5764419Z self=, 2025-05-07T20:31:49.5764810Z T=4096, 2025-05-07T20:31:49.5764990Z D=7168, 2025-05-07T20:31:49.5765186Z contiguous=False, 2025-05-07T20:31:49.5765410Z compiled=True, 2025-05-07T20:31:49.5765606Z ) 2025-05-07T20:31:49.5765800Z Trying example: test_silu_mul( 2025-05-07T20:31:49.5766174Z self=, 2025-05-07T20:31:49.5766561Z T=128, 2025-05-07T20:31:49.5766755Z D=5120, 2025-05-07T20:31:49.5766947Z contiguous=True, 2025-05-07T20:31:49.5767163Z compiled=False, 2025-05-07T20:31:49.5767370Z ) 2025-05-07T20:31:49.5767562Z Trying example: test_silu_mul( 2025-05-07T20:31:49.5767940Z self=, 2025-05-07T20:31:49.5768334Z T=128, 2025-05-07T20:31:49.5768518Z D=5120, 2025-05-07T20:31:49.5768716Z contiguous=False, 2025-05-07T20:31:49.5768935Z compiled=False, 2025-05-07T20:31:49.5769137Z ) 2025-05-07T20:31:49.5769329Z Trying example: test_silu_mul( 2025-05-07T20:31:49.5769698Z self=, 2025-05-07T20:31:49.5770092Z T=1, 2025-05-07T20:31:49.5770303Z D=5120, 2025-05-07T20:31:49.5770496Z contiguous=True, 2025-05-07T20:31:49.5770714Z compiled=False, 2025-05-07T20:31:49.5770922Z ) 2025-05-07T20:31:49.5771114Z Trying example: test_silu_mul( 2025-05-07T20:31:49.5771487Z self=, 2025-05-07T20:31:49.5771974Z T=2048, 2025-05-07T20:31:49.5772160Z D=7168, 2025-05-07T20:31:49.5772351Z contiguous=False, 2025-05-07T20:31:49.5772578Z compiled=True, 2025-05-07T20:31:49.5772789Z ) 2025-05-07T20:31:49.5772983Z Trying example: test_silu_mul( 2025-05-07T20:31:49.5773366Z self=, 2025-05-07T20:31:49.5773763Z T=2048, 2025-05-07T20:31:49.5773946Z D=7168, 2025-05-07T20:31:49.5774143Z contiguous=False, 2025-05-07T20:31:49.5774369Z compiled=False, 2025-05-07T20:31:49.5774569Z ) 2025-05-07T20:31:49.5774765Z Trying example: test_silu_mul( 2025-05-07T20:31:49.5775143Z self=, 2025-05-07T20:31:49.5775537Z T=16384, 2025-05-07T20:31:49.5775725Z D=7168, 2025-05-07T20:31:49.5775917Z contiguous=False, 2025-05-07T20:31:49.5776140Z compiled=True, 2025-05-07T20:31:49.5776340Z ) 2025-05-07T20:31:49.5776541Z Trying example: test_silu_mul( 2025-05-07T20:31:49.5776919Z self=, 2025-05-07T20:31:49.5777311Z T=16384, 2025-05-07T20:31:49.5777506Z D=7168, 2025-05-07T20:31:49.5777704Z contiguous=True, 2025-05-07T20:31:49.5777920Z compiled=True, 2025-05-07T20:31:49.5778126Z ) 2025-05-07T20:31:49.5778324Z Trying example: test_silu_mul( 2025-05-07T20:31:49.5778697Z self=, 2025-05-07T20:31:49.5779093Z T=4096, 2025-05-07T20:31:49.5779278Z D=7168, 2025-05-07T20:31:49.5779466Z contiguous=True, 2025-05-07T20:31:49.5779690Z compiled=True, 2025-05-07T20:31:49.5779893Z ) 2025-05-07T20:31:49.5780083Z Trying example: test_silu_mul( 2025-05-07T20:31:49.5780461Z self=, 2025-05-07T20:31:49.5780859Z T=2048, 2025-05-07T20:31:49.5781042Z D=5120, 2025-05-07T20:31:49.5781237Z contiguous=False, 2025-05-07T20:31:49.5781472Z compiled=False, 2025-05-07T20:31:49.5781677Z ) 2025-05-07T20:31:49.5781866Z Trying example: test_silu_mul( 2025-05-07T20:31:49.5782344Z self=, 2025-05-07T20:31:49.5783082Z T=2048, 2025-05-07T20:31:49.5783327Z D=5120, 2025-05-07T20:31:49.5783523Z contiguous=True, 2025-05-07T20:31:49.5783747Z compiled=False, 2025-05-07T20:31:49.5783950Z ) 2025-05-07T20:31:49.5784148Z Trying example: test_silu_mul( 2025-05-07T20:31:49.5784531Z self=, 2025-05-07T20:31:49.5784919Z T=128, 2025-05-07T20:31:49.5785108Z D=7168, 2025-05-07T20:31:49.5785305Z contiguous=False, 2025-05-07T20:31:49.5785526Z compiled=True, 2025-05-07T20:31:49.5785729Z ) 2025-05-07T20:31:49.5785928Z Trying example: test_silu_mul( 2025-05-07T20:31:49.5786305Z self=, 2025-05-07T20:31:49.5786700Z T=16384, 2025-05-07T20:31:49.5786905Z D=5120, 2025-05-07T20:31:49.5787090Z contiguous=True, 2025-05-07T20:31:49.5787315Z compiled=True, 2025-05-07T20:31:49.5787521Z ) 2025-05-07T20:31:49.5787726Z Trying example: test_silu_mul( 2025-05-07T20:31:49.5788224Z self=, 2025-05-07T20:31:49.5788621Z T=2048, 2025-05-07T20:31:49.5788810Z D=5120, 2025-05-07T20:31:49.5789006Z contiguous=False, 2025-05-07T20:31:49.5789230Z compiled=True, 2025-05-07T20:31:49.5789432Z ) 2025-05-07T20:31:49.5789631Z Trying example: test_silu_mul( 2025-05-07T20:31:49.5790115Z self=, 2025-05-07T20:31:49.5790514Z T=16384, 2025-05-07T20:31:49.5790708Z D=5120, 2025-05-07T20:31:49.5790905Z contiguous=True, 2025-05-07T20:31:49.5791127Z compiled=False, 2025-05-07T20:31:49.5791335Z ) 2025-05-07T20:31:49.5791534Z Trying example: test_silu_mul( 2025-05-07T20:31:49.5791920Z self=, 2025-05-07T20:31:49.5792501Z T=16384, 2025-05-07T20:31:49.5792695Z D=7168, 2025-05-07T20:31:49.5792885Z contiguous=False, 2025-05-07T20:31:49.5793116Z compiled=False, 2025-05-07T20:31:49.5793325Z ) 2025-05-07T20:31:49.5793513Z Trying example: test_silu_mul( 2025-05-07T20:31:49.5793894Z self=, 2025-05-07T20:31:49.5794292Z T=16384, 2025-05-07T20:31:49.5794482Z D=7168, 2025-05-07T20:31:49.5794670Z contiguous=True, 2025-05-07T20:31:49.5794894Z compiled=False, 2025-05-07T20:31:49.5795098Z ) 2025-05-07T20:31:49.5795265Z PASSED 2025-05-07T20:31:49.6407422Z W0507 20:31:49.639295 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:49.6408616Z W0507 20:31:49.639295 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Traceback (most recent call last): 2025-05-07T20:31:49.6410142Z W0507 20:31:49.639295 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:49.6411752Z W0507 20:31:49.639295 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:49.6413276Z W0507 20:31:49.639295 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:49.6414817Z W0507 20:31:49.639295 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:49.6416649Z W0507 20:31:49.639295 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:49.6418178Z W0507 20:31:49.639295 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:49.6419741Z W0507 20:31:49.639295 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:49.6421117Z W0507 20:31:49.639295 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] generator.visit(fn.parse()) 2025-05-07T20:31:49.6422455Z W0507 20:31:49.639295 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:49.6423808Z W0507 20:31:49.639295 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ret = super().visit(node) 2025-05-07T20:31:49.6424946Z W0507 20:31:49.639295 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit 2025-05-07T20:31:49.6426067Z W0507 20:31:49.639295 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] return visitor(node) 2025-05-07T20:31:49.6427406Z W0507 20:31:49.639295 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:49.6428825Z W0507 20:31:49.639295 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:49.6430448Z W0507 20:31:49.639295 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit 2025-05-07T20:31:49.6431596Z W0507 20:31:49.639295 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] self.visit(item) 2025-05-07T20:31:49.6432891Z W0507 20:31:49.639295 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:49.6434388Z W0507 20:31:49.639295 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:49.6435599Z W0507 20:31:49.639295 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:49.6436602Z W0507 20:31:49.639295 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:49.6437403Z W0507 20:31:49.639295 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^ 2025-05-07T20:31:49.6438514Z W0507 20:31:49.639295 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:49.6581195Z W0507 20:31:49.657546 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:49.6582361Z W0507 20:31:49.657546 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Traceback (most recent call last): 2025-05-07T20:31:49.6584477Z W0507 20:31:49.657546 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:49.6586067Z W0507 20:31:49.657546 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:49.6587590Z W0507 20:31:49.657546 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:49.6589110Z W0507 20:31:49.657546 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:49.6590660Z W0507 20:31:49.657546 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:49.6592188Z W0507 20:31:49.657546 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:49.6593750Z W0507 20:31:49.657546 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:49.6595121Z W0507 20:31:49.657546 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] generator.visit(fn.parse()) 2025-05-07T20:31:49.6596464Z W0507 20:31:49.657546 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:49.6597934Z W0507 20:31:49.657546 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ret = super().visit(node) 2025-05-07T20:31:49.6599076Z W0507 20:31:49.657546 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit 2025-05-07T20:31:49.6600189Z W0507 20:31:49.657546 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] return visitor(node) 2025-05-07T20:31:49.6601525Z W0507 20:31:49.657546 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:49.6602931Z W0507 20:31:49.657546 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:49.6604163Z W0507 20:31:49.657546 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit 2025-05-07T20:31:49.6605299Z W0507 20:31:49.657546 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] self.visit(item) 2025-05-07T20:31:49.6606590Z W0507 20:31:49.657546 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:49.6608082Z W0507 20:31:49.657546 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:49.6609235Z W0507 20:31:49.657546 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:49.6610227Z W0507 20:31:49.657546 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:49.6611153Z W0507 20:31:49.657546 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^ 2025-05-07T20:31:49.6612274Z W0507 20:31:49.657546 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:49.7007607Z W0507 20:31:49.700131 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:49.7008774Z W0507 20:31:49.700131 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Traceback (most recent call last): 2025-05-07T20:31:49.7010253Z W0507 20:31:49.700131 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:49.7011871Z W0507 20:31:49.700131 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:49.7013397Z W0507 20:31:49.700131 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:49.7014923Z W0507 20:31:49.700131 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:49.7016356Z W0507 20:31:49.700131 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:49.7018174Z W0507 20:31:49.700131 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:49.7019738Z W0507 20:31:49.700131 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:49.7021114Z W0507 20:31:49.700131 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] generator.visit(fn.parse()) 2025-05-07T20:31:49.7022453Z W0507 20:31:49.700131 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:49.7023791Z W0507 20:31:49.700131 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ret = super().visit(node) 2025-05-07T20:31:49.7024922Z W0507 20:31:49.700131 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit 2025-05-07T20:31:49.7026040Z W0507 20:31:49.700131 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] return visitor(node) 2025-05-07T20:31:49.7027381Z W0507 20:31:49.700131 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:49.7028795Z W0507 20:31:49.700131 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:49.7030141Z W0507 20:31:49.700131 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit 2025-05-07T20:31:49.7031436Z W0507 20:31:49.700131 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] self.visit(item) 2025-05-07T20:31:49.7032738Z W0507 20:31:49.700131 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:49.7034238Z W0507 20:31:49.700131 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:49.7035442Z W0507 20:31:49.700131 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:49.7036426Z W0507 20:31:49.700131 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:49.7037232Z W0507 20:31:49.700131 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^ 2025-05-07T20:31:49.7038347Z W0507 20:31:49.700131 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:49.7049590Z W0507 20:31:49.704468 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:49.7050750Z W0507 20:31:49.704468 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Traceback (most recent call last): 2025-05-07T20:31:49.7052210Z W0507 20:31:49.704468 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:49.7053926Z W0507 20:31:49.704468 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:49.7055452Z W0507 20:31:49.704468 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:49.7056975Z W0507 20:31:49.704468 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:49.7058404Z W0507 20:31:49.704468 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:49.7059922Z W0507 20:31:49.704468 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:49.7061470Z W0507 20:31:49.704468 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:49.7062838Z W0507 20:31:49.704468 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] generator.visit(fn.parse()) 2025-05-07T20:31:49.7064177Z W0507 20:31:49.704468 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:49.7065554Z W0507 20:31:49.704468 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ret = super().visit(node) 2025-05-07T20:31:49.7067124Z W0507 20:31:49.704468 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit 2025-05-07T20:31:49.7068239Z W0507 20:31:49.704468 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] return visitor(node) 2025-05-07T20:31:49.7069579Z W0507 20:31:49.704468 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:49.7071140Z W0507 20:31:49.704468 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:49.7072366Z W0507 20:31:49.704468 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit 2025-05-07T20:31:49.7073515Z W0507 20:31:49.704468 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] self.visit(item) 2025-05-07T20:31:49.7074815Z W0507 20:31:49.704468 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:49.7076312Z W0507 20:31:49.704468 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:49.7077479Z W0507 20:31:49.704468 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:49.7078471Z W0507 20:31:49.704468 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:49.7079362Z W0507 20:31:49.704468 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^ 2025-05-07T20:31:49.7080491Z W0507 20:31:49.704468 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:50.2147370Z moe/activation_test.py::ActivationTests::test_silu_mul_quant Trying example: test_silu_mul_quant( 2025-05-07T20:31:50.2148107Z self=, 2025-05-07T20:31:50.2148553Z T=1, 2025-05-07T20:31:50.2148756Z D=5120, 2025-05-07T20:31:50.2148961Z scale_ub=None, 2025-05-07T20:31:50.2149183Z contiguous=True, 2025-05-07T20:31:50.2149413Z compiled=True, 2025-05-07T20:31:50.2149636Z ) 2025-05-07T20:31:50.2150127Z self = 2025-05-07T20:31:50.2150686Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:50.2150976Z 2025-05-07T20:31:50.2151057Z @given( 2025-05-07T20:31:50.2151309Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:50.2151640Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:50.2151961Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:50.2152311Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:50.2152652Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:50.2152950Z ) 2025-05-07T20:31:50.2153346Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:50.2153830Z def test_silu_mul_quant( 2025-05-07T20:31:50.2154078Z self, 2025-05-07T20:31:50.2154279Z T: int, 2025-05-07T20:31:50.2154482Z D: int, 2025-05-07T20:31:50.2154698Z scale_ub: Optional[float], 2025-05-07T20:31:50.2154984Z contiguous: bool, 2025-05-07T20:31:50.2155237Z compiled: bool, 2025-05-07T20:31:50.2155472Z ) -> None: 2025-05-07T20:31:50.2155681Z torch.manual_seed(2025) 2025-05-07T20:31:50.2155936Z 2025-05-07T20:31:50.2157348Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:50.2157717Z 2025-05-07T20:31:50.2157912Z x_sign = torch.sign(x) 2025-05-07T20:31:50.2158215Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:50.2158535Z x = x_sign * x_clamp 2025-05-07T20:31:50.2158786Z x0 = x[:, :D] 2025-05-07T20:31:50.2159011Z x1 = x[:, D:] 2025-05-07T20:31:50.2159219Z 2025-05-07T20:31:50.2159409Z if contiguous: 2025-05-07T20:31:50.2159651Z x0 = x0.contiguous() 2025-05-07T20:31:50.2159910Z x1 = x1.contiguous() 2025-05-07T20:31:50.2160159Z 2025-05-07T20:31:50.2160353Z if scale_ub is not None: 2025-05-07T20:31:50.2160629Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:50.2160986Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:50.2161309Z ) 2025-05-07T20:31:50.2161505Z else: 2025-05-07T20:31:50.2161711Z scale_ub_tensor = None 2025-05-07T20:31:50.2161977Z 2025-05-07T20:31:50.2162224Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:50.2162556Z op = silu_mul_quant 2025-05-07T20:31:50.2162807Z if compiled: 2025-05-07T20:31:50.2163062Z op = torch.compile(op) 2025-05-07T20:31:50.2163376Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:50.2163661Z 2025-05-07T20:31:50.2163863Z y_fp8, y_scale = fn() 2025-05-07T20:31:50.2164159Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:50.2164456Z 2025-05-07T20:31:50.2164695Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:50.2165047Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:50.2173687Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:50.2174240Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:50.2174617Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:50.2174947Z 2025-05-07T20:31:50.2175159Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:50.2175368Z 2025-05-07T20:31:50.2175477Z moe/activation_test.py:126: 2025-05-07T20:31:50.2175781Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:50.2176135Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:50.2176477Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:50.2177330Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:50.2178157Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:50.2178742Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:50.2179649Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:50.2180396Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:50.2181177Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:50.2181996Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:31:50.2183226Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:50.2184078Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:50.2184771Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:50.2185413Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:50.2185975Z fn() 2025-05-07T20:31:50.2186529Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:50.2187323Z self.fn.run( 2025-05-07T20:31:50.2187829Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:50.2188390Z kernel = self.compile( 2025-05-07T20:31:50.2188963Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:50.2189878Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:50.2190425Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:50.2190755Z 2025-05-07T20:31:50.2191035Z self = 2025-05-07T20:31:50.2192456Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:50.2193998Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1f214764c0>} 2025-05-07T20:31:50.2195487Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:50.2196613Z context = 2025-05-07T20:31:50.2196920Z 2025-05-07T20:31:50.2197098Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:50.2197645Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:50.2198141Z module_map=module_map) 2025-05-07T20:31:50.2198676Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:50.2199038Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:50.2199314Z E ^ 2025-05-07T20:31:50.2199814Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:50.2200304Z 2025-05-07T20:31:50.2200760Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:50.2201320Z 2025-05-07T20:31:50.2201423Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:50.2201858Z self=, 2025-05-07T20:31:50.2202281Z T=2048, 2025-05-07T20:31:50.2202470Z D=5120, 2025-05-07T20:31:50.2202673Z scale_ub=1200.0, 2025-05-07T20:31:50.2202906Z contiguous=True, 2025-05-07T20:31:50.2203139Z compiled=False, 2025-05-07T20:31:50.2203351Z ) 2025-05-07T20:31:50.8081104Z W0507 20:31:50.803711 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:50.8082327Z W0507 20:31:50.803711 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Traceback (most recent call last): 2025-05-07T20:31:50.8084105Z W0507 20:31:50.803711 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:50.8085699Z W0507 20:31:50.803711 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:50.8087226Z W0507 20:31:50.803711 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:50.8089095Z W0507 20:31:50.803711 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:50.8090539Z W0507 20:31:50.803711 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:50.8092054Z W0507 20:31:50.803711 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:50.8093618Z W0507 20:31:50.803711 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:50.8094992Z W0507 20:31:50.803711 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] generator.visit(fn.parse()) 2025-05-07T20:31:50.8096338Z W0507 20:31:50.803711 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:50.8097664Z W0507 20:31:50.803711 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ret = super().visit(node) 2025-05-07T20:31:50.8098797Z W0507 20:31:50.803711 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit 2025-05-07T20:31:50.8099908Z W0507 20:31:50.803711 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] return visitor(node) 2025-05-07T20:31:50.8101412Z W0507 20:31:50.803711 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:50.8102824Z W0507 20:31:50.803711 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:50.8104046Z W0507 20:31:50.803711 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit 2025-05-07T20:31:50.8105184Z W0507 20:31:50.803711 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] self.visit(item) 2025-05-07T20:31:50.8106469Z W0507 20:31:50.803711 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:50.8107974Z W0507 20:31:50.803711 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:50.8109130Z W0507 20:31:50.803711 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:50.8110318Z W0507 20:31:50.803711 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:50.8111117Z W0507 20:31:50.803711 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^ 2025-05-07T20:31:50.8112226Z W0507 20:31:50.803711 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:51.0130518Z W0507 20:31:51.009091 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:51.0131995Z W0507 20:31:51.009091 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Traceback (most recent call last): 2025-05-07T20:31:51.0133490Z W0507 20:31:51.009091 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:51.0135080Z W0507 20:31:51.009091 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:51.0136620Z W0507 20:31:51.009091 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:51.0138179Z W0507 20:31:51.009091 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:51.0139636Z W0507 20:31:51.009091 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:51.0141171Z W0507 20:31:51.009091 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:51.0142745Z W0507 20:31:51.009091 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:51.0144336Z W0507 20:31:51.009091 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] generator.visit(fn.parse()) 2025-05-07T20:31:51.0145686Z W0507 20:31:51.009091 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:51.0147017Z W0507 20:31:51.009091 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ret = super().visit(node) 2025-05-07T20:31:51.0148151Z W0507 20:31:51.009091 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit 2025-05-07T20:31:51.0149266Z W0507 20:31:51.009091 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] return visitor(node) 2025-05-07T20:31:51.0150773Z W0507 20:31:51.009091 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:51.0152208Z W0507 20:31:51.009091 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:51.0153438Z W0507 20:31:51.009091 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit 2025-05-07T20:31:51.0154588Z W0507 20:31:51.009091 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] self.visit(item) 2025-05-07T20:31:51.0155888Z W0507 20:31:51.009091 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:51.0157402Z W0507 20:31:51.009091 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:51.0158685Z W0507 20:31:51.009091 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:51.0159672Z W0507 20:31:51.009091 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:51.0160471Z W0507 20:31:51.009091 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^ 2025-05-07T20:31:51.0161582Z W0507 20:31:51.009091 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:51.5721911Z W0507 20:31:51.568174 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:51.5723126Z W0507 20:31:51.568174 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Traceback (most recent call last): 2025-05-07T20:31:51.5724674Z W0507 20:31:51.568174 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:51.5726266Z W0507 20:31:51.568174 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:51.5727799Z W0507 20:31:51.568174 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:51.5729640Z W0507 20:31:51.568174 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:51.5731096Z W0507 20:31:51.568174 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:51.5732618Z W0507 20:31:51.568174 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:51.5734182Z W0507 20:31:51.568174 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:51.5735561Z W0507 20:31:51.568174 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] generator.visit(fn.parse()) 2025-05-07T20:31:51.5736919Z W0507 20:31:51.568174 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:51.5738259Z W0507 20:31:51.568174 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ret = super().visit(node) 2025-05-07T20:31:51.5739399Z W0507 20:31:51.568174 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit 2025-05-07T20:31:51.5740513Z W0507 20:31:51.568174 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] return visitor(node) 2025-05-07T20:31:51.5741864Z W0507 20:31:51.568174 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:51.5743416Z W0507 20:31:51.568174 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:51.5744643Z W0507 20:31:51.568174 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit 2025-05-07T20:31:51.5745788Z W0507 20:31:51.568174 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] self.visit(item) 2025-05-07T20:31:51.5747079Z W0507 20:31:51.568174 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:51.5748581Z W0507 20:31:51.568174 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:51.5749872Z W0507 20:31:51.568174 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:51.5750863Z W0507 20:31:51.568174 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:51.5751667Z W0507 20:31:51.568174 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^ 2025-05-07T20:31:51.5752776Z W0507 20:31:51.568174 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:51.6111951Z W0507 20:31:51.607339 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:51.6113276Z W0507 20:31:51.607339 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Traceback (most recent call last): 2025-05-07T20:31:51.6114750Z W0507 20:31:51.607339 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:51.6116322Z W0507 20:31:51.607339 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:51.6117834Z W0507 20:31:51.607339 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:51.6119361Z W0507 20:31:51.607339 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:51.6120802Z W0507 20:31:51.607339 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:51.6122316Z W0507 20:31:51.607339 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:51.6123879Z W0507 20:31:51.607339 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:51.6125252Z W0507 20:31:51.607339 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] generator.visit(fn.parse()) 2025-05-07T20:31:51.6126709Z W0507 20:31:51.607339 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:51.6128043Z W0507 20:31:51.607339 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ret = super().visit(node) 2025-05-07T20:31:51.6129181Z W0507 20:31:51.607339 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit 2025-05-07T20:31:51.6130302Z W0507 20:31:51.607339 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] return visitor(node) 2025-05-07T20:31:51.6131635Z W0507 20:31:51.607339 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:51.6133052Z W0507 20:31:51.607339 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:51.6134279Z W0507 20:31:51.607339 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit 2025-05-07T20:31:51.6135415Z W0507 20:31:51.607339 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] self.visit(item) 2025-05-07T20:31:51.6136705Z W0507 20:31:51.607339 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:51.6138189Z W0507 20:31:51.607339 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:51.6139435Z W0507 20:31:51.607339 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:51.6140423Z W0507 20:31:51.607339 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:51.6141226Z W0507 20:31:51.607339 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^ 2025-05-07T20:31:51.6142331Z W0507 20:31:51.607339 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:52.3742900Z self = 2025-05-07T20:31:52.3744289Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:31:52.3744797Z 2025-05-07T20:31:52.3744884Z @given( 2025-05-07T20:31:52.3745140Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:52.3745467Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:52.3745793Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:52.3746125Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:52.3746463Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:52.3746760Z ) 2025-05-07T20:31:52.3747126Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:52.3747609Z def test_silu_mul_quant( 2025-05-07T20:31:52.3747851Z self, 2025-05-07T20:31:52.3748047Z T: int, 2025-05-07T20:31:52.3748248Z D: int, 2025-05-07T20:31:52.3748460Z scale_ub: Optional[float], 2025-05-07T20:31:52.3748737Z contiguous: bool, 2025-05-07T20:31:52.3748977Z compiled: bool, 2025-05-07T20:31:52.3749209Z ) -> None: 2025-05-07T20:31:52.3749429Z torch.manual_seed(2025) 2025-05-07T20:31:52.3749779Z 2025-05-07T20:31:52.3750055Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:52.3750418Z 2025-05-07T20:31:52.3750611Z x_sign = torch.sign(x) 2025-05-07T20:31:52.3751247Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:52.3751567Z x = x_sign * x_clamp 2025-05-07T20:31:52.3751814Z x0 = x[:, :D] 2025-05-07T20:31:52.3752036Z x1 = x[:, D:] 2025-05-07T20:31:52.3752243Z 2025-05-07T20:31:52.3752428Z if contiguous: 2025-05-07T20:31:52.3752658Z x0 = x0.contiguous() 2025-05-07T20:31:52.3752914Z x1 = x1.contiguous() 2025-05-07T20:31:52.3753154Z 2025-05-07T20:31:52.3753342Z if scale_ub is not None: 2025-05-07T20:31:52.3753613Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:52.3753957Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:52.3754274Z ) 2025-05-07T20:31:52.3754460Z else: 2025-05-07T20:31:52.3754674Z scale_ub_tensor = None 2025-05-07T20:31:52.3754930Z 2025-05-07T20:31:52.3755156Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:52.3755488Z op = silu_mul_quant 2025-05-07T20:31:52.3755742Z if compiled: 2025-05-07T20:31:52.3755987Z op = torch.compile(op) 2025-05-07T20:31:52.3756292Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:52.3756580Z 2025-05-07T20:31:52.3756779Z > y_fp8, y_scale = fn() 2025-05-07T20:31:52.3756945Z 2025-05-07T20:31:52.3757047Z moe/activation_test.py:117: 2025-05-07T20:31:52.3757347Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:52.3757695Z moe/activation_test.py:115: in fn 2025-05-07T20:31:52.3757978Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:52.3758726Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:52.3759827Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:52.3760402Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:52.3761142Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:52.3761858Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:52.3762430Z kernel = self.compile( 2025-05-07T20:31:52.3763000Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:52.3763711Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:52.3764128Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:52.3764386Z 2025-05-07T20:31:52.3764643Z self = 2025-05-07T20:31:52.3765828Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:52.3767433Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1f214d3ca0>} 2025-05-07T20:31:52.3768910Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:52.3770045Z context = 2025-05-07T20:31:52.3770422Z 2025-05-07T20:31:52.3770664Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:52.3771218Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:52.3771719Z module_map=module_map) 2025-05-07T20:31:52.3772096Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:52.3772565Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:52.3772829Z E ^ 2025-05-07T20:31:52.3773329Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:52.3773821Z 2025-05-07T20:31:52.3774280Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:52.3774838Z 2025-05-07T20:31:52.3774950Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:52.3775376Z self=, 2025-05-07T20:31:52.3775797Z T=2048, 2025-05-07T20:31:52.3775987Z D=5120, 2025-05-07T20:31:52.3776179Z scale_ub=1200.0, 2025-05-07T20:31:52.3776402Z contiguous=True, 2025-05-07T20:31:52.3776631Z compiled=True, 2025-05-07T20:31:52.3776834Z ) 2025-05-07T20:31:52.3777161Z self = 2025-05-07T20:31:52.3777688Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:31:52.3777978Z 2025-05-07T20:31:52.3778057Z @given( 2025-05-07T20:31:52.3778294Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:52.3778622Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:52.3778945Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:52.3779290Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:52.3779637Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:52.3779942Z ) 2025-05-07T20:31:52.3780305Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:52.3780774Z def test_silu_mul_quant( 2025-05-07T20:31:52.3781022Z self, 2025-05-07T20:31:52.3781218Z T: int, 2025-05-07T20:31:52.3781506Z D: int, 2025-05-07T20:31:52.3781734Z scale_ub: Optional[float], 2025-05-07T20:31:52.3782004Z contiguous: bool, 2025-05-07T20:31:52.3782254Z compiled: bool, 2025-05-07T20:31:52.3782495Z ) -> None: 2025-05-07T20:31:52.3782714Z torch.manual_seed(2025) 2025-05-07T20:31:52.3783323Z 2025-05-07T20:31:52.3783612Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:52.3783975Z 2025-05-07T20:31:52.3784169Z x_sign = torch.sign(x) 2025-05-07T20:31:52.3784470Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:52.3784825Z x = x_sign * x_clamp 2025-05-07T20:31:52.3785086Z x0 = x[:, :D] 2025-05-07T20:31:52.3785307Z x1 = x[:, D:] 2025-05-07T20:31:52.3785520Z 2025-05-07T20:31:52.3785703Z if contiguous: 2025-05-07T20:31:52.3785948Z x0 = x0.contiguous() 2025-05-07T20:31:52.3786221Z x1 = x1.contiguous() 2025-05-07T20:31:52.3786474Z 2025-05-07T20:31:52.3786675Z if scale_ub is not None: 2025-05-07T20:31:52.3786966Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:52.3787318Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:52.3787643Z ) 2025-05-07T20:31:52.3787841Z else: 2025-05-07T20:31:52.3788055Z scale_ub_tensor = None 2025-05-07T20:31:52.3788320Z 2025-05-07T20:31:52.3788563Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:52.3788891Z op = silu_mul_quant 2025-05-07T20:31:52.3789159Z if compiled: 2025-05-07T20:31:52.3789421Z op = torch.compile(op) 2025-05-07T20:31:52.3789837Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:52.3790118Z 2025-05-07T20:31:52.3790309Z y_fp8, y_scale = fn() 2025-05-07T20:31:52.3790603Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:52.3790896Z 2025-05-07T20:31:52.3791134Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:52.3791484Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:52.3791782Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:52.3792298Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:52.3792683Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:52.3793010Z 2025-05-07T20:31:52.3793216Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:52.3793429Z 2025-05-07T20:31:52.3793527Z moe/activation_test.py:126: 2025-05-07T20:31:52.3793836Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:52.3794181Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:52.3794518Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:52.3795374Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:52.3796186Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:52.3796767Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:52.3797510Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:52.3798248Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:52.3799017Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:52.3799829Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:31:52.3800638Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:52.3801425Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:52.3802230Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:52.3802872Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:52.3803434Z fn() 2025-05-07T20:31:52.3803969Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:52.3804592Z self.fn.run( 2025-05-07T20:31:52.3805090Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:52.3805665Z kernel = self.compile( 2025-05-07T20:31:52.3806240Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:52.3806946Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:52.3807372Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:52.3807621Z 2025-05-07T20:31:52.3807841Z self = 2025-05-07T20:31:52.3809022Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:52.3810530Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1f21ae3af0>} 2025-05-07T20:31:52.3812002Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:52.3813116Z context = 2025-05-07T20:31:52.3813428Z 2025-05-07T20:31:52.3813598Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:52.3814160Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:52.3814659Z module_map=module_map) 2025-05-07T20:31:52.3815149Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:52.3815513Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:52.3815793Z E ^ 2025-05-07T20:31:52.3816288Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:52.3816778Z 2025-05-07T20:31:52.3817227Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:52.3817789Z 2025-05-07T20:31:52.3817891Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:52.3818317Z self=, 2025-05-07T20:31:52.3818739Z T=16384, 2025-05-07T20:31:52.3818933Z D=7168, 2025-05-07T20:31:52.3819135Z scale_ub=1200.0, 2025-05-07T20:31:52.3819364Z contiguous=False, 2025-05-07T20:31:52.3819592Z compiled=False, 2025-05-07T20:31:52.3819797Z ) 2025-05-07T20:31:52.8111067Z W0507 20:31:52.807044 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:52.8112255Z W0507 20:31:52.807044 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Traceback (most recent call last): 2025-05-07T20:31:52.8113734Z W0507 20:31:52.807044 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:52.8115323Z W0507 20:31:52.807044 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:52.8117189Z W0507 20:31:52.807044 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:52.8118724Z W0507 20:31:52.807044 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:52.8120169Z W0507 20:31:52.807044 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:52.8121685Z W0507 20:31:52.807044 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:52.8123251Z W0507 20:31:52.807044 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:52.8124631Z W0507 20:31:52.807044 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] generator.visit(fn.parse()) 2025-05-07T20:31:52.8125978Z W0507 20:31:52.807044 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:52.8127311Z W0507 20:31:52.807044 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ret = super().visit(node) 2025-05-07T20:31:52.8128451Z W0507 20:31:52.807044 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit 2025-05-07T20:31:52.8129577Z W0507 20:31:52.807044 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] return visitor(node) 2025-05-07T20:31:52.8131074Z W0507 20:31:52.807044 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:52.8132495Z W0507 20:31:52.807044 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:52.8133719Z W0507 20:31:52.807044 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit 2025-05-07T20:31:52.8134861Z W0507 20:31:52.807044 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] self.visit(item) 2025-05-07T20:31:52.8136160Z W0507 20:31:52.807044 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:52.8137662Z W0507 20:31:52.807044 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:52.8138825Z W0507 20:31:52.807044 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:52.8139816Z W0507 20:31:52.807044 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:52.8140622Z W0507 20:31:52.807044 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^ 2025-05-07T20:31:52.8142026Z W0507 20:31:52.807044 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:52.9744870Z W0507 20:31:52.970502 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:52.9754108Z W0507 20:31:52.970502 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Traceback (most recent call last): 2025-05-07T20:31:52.9755855Z W0507 20:31:52.970502 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:52.9758564Z W0507 20:31:52.970502 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:52.9760331Z W0507 20:31:52.970502 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:52.9761874Z W0507 20:31:52.970502 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:52.9763320Z W0507 20:31:52.970502 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:52.9764847Z W0507 20:31:52.970502 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:52.9766422Z W0507 20:31:52.970502 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:52.9768076Z W0507 20:31:52.970502 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] generator.visit(fn.parse()) 2025-05-07T20:31:52.9769419Z W0507 20:31:52.970502 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:52.9770750Z W0507 20:31:52.970502 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ret = super().visit(node) 2025-05-07T20:31:52.9771890Z W0507 20:31:52.970502 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit 2025-05-07T20:31:52.9773014Z W0507 20:31:52.970502 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] return visitor(node) 2025-05-07T20:31:52.9774503Z W0507 20:31:52.970502 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:52.9775915Z W0507 20:31:52.970502 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:52.9777143Z W0507 20:31:52.970502 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit 2025-05-07T20:31:52.9778291Z W0507 20:31:52.970502 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] self.visit(item) 2025-05-07T20:31:52.9779597Z W0507 20:31:52.970502 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:52.9781272Z W0507 20:31:52.970502 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:52.9782428Z W0507 20:31:52.970502 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:52.9783846Z W0507 20:31:52.970502 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:52.9784653Z W0507 20:31:52.970502 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^ 2025-05-07T20:31:52.9786056Z W0507 20:31:52.970502 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:53.4705209Z W0507 20:31:53.466465 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:53.4706393Z W0507 20:31:53.466465 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Traceback (most recent call last): 2025-05-07T20:31:53.4707864Z W0507 20:31:53.466465 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:53.4709452Z W0507 20:31:53.466465 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:53.4711083Z W0507 20:31:53.466465 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:53.4712934Z W0507 20:31:53.466465 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:53.4714378Z W0507 20:31:53.466465 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:53.4715899Z W0507 20:31:53.466465 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:53.4717455Z W0507 20:31:53.466465 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:53.4718838Z W0507 20:31:53.466465 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] generator.visit(fn.parse()) 2025-05-07T20:31:53.4720178Z W0507 20:31:53.466465 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:53.4721503Z W0507 20:31:53.466465 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ret = super().visit(node) 2025-05-07T20:31:53.4722630Z W0507 20:31:53.466465 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit 2025-05-07T20:31:53.4723747Z W0507 20:31:53.466465 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] return visitor(node) 2025-05-07T20:31:53.4725247Z W0507 20:31:53.466465 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:53.4726656Z W0507 20:31:53.466465 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:53.4727874Z W0507 20:31:53.466465 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit 2025-05-07T20:31:53.4729009Z W0507 20:31:53.466465 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] self.visit(item) 2025-05-07T20:31:53.4730307Z W0507 20:31:53.466465 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:53.4731816Z W0507 20:31:53.466465 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:53.4732978Z W0507 20:31:53.466465 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:53.4733967Z W0507 20:31:53.466465 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:53.4734762Z W0507 20:31:53.466465 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^ 2025-05-07T20:31:53.4735879Z W0507 20:31:53.466465 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:53.5097822Z W0507 20:31:53.505916 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:53.5099147Z W0507 20:31:53.505916 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Traceback (most recent call last): 2025-05-07T20:31:53.5100731Z W0507 20:31:53.505916 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:53.5102289Z W0507 20:31:53.505916 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:53.5103812Z W0507 20:31:53.505916 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:53.5105349Z W0507 20:31:53.505916 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:53.5106784Z W0507 20:31:53.505916 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:53.5108296Z W0507 20:31:53.505916 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:53.5109964Z W0507 20:31:53.505916 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:53.5111468Z W0507 20:31:53.505916 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] generator.visit(fn.parse()) 2025-05-07T20:31:53.5112814Z W0507 20:31:53.505916 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:53.5114145Z W0507 20:31:53.505916 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ret = super().visit(node) 2025-05-07T20:31:53.5115333Z W0507 20:31:53.505916 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit 2025-05-07T20:31:53.5116450Z W0507 20:31:53.505916 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] return visitor(node) 2025-05-07T20:31:53.5117783Z W0507 20:31:53.505916 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:53.5119204Z W0507 20:31:53.505916 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:53.5120426Z W0507 20:31:53.505916 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit 2025-05-07T20:31:53.5121564Z W0507 20:31:53.505916 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] self.visit(item) 2025-05-07T20:31:53.5122855Z W0507 20:31:53.505916 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:53.5124362Z W0507 20:31:53.505916 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:53.5125926Z W0507 20:31:53.505916 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:53.5126921Z W0507 20:31:53.505916 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:53.5127722Z W0507 20:31:53.505916 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^ 2025-05-07T20:31:53.5128829Z W0507 20:31:53.505916 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:54.9858418Z self = 2025-05-07T20:31:54.9859309Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:31:54.9859732Z 2025-05-07T20:31:54.9859842Z @given( 2025-05-07T20:31:54.9860180Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:54.9860642Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:54.9861093Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:54.9861595Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:54.9862062Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:54.9862362Z ) 2025-05-07T20:31:54.9862742Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:54.9863227Z def test_silu_mul_quant( 2025-05-07T20:31:54.9863488Z self, 2025-05-07T20:31:54.9863689Z T: int, 2025-05-07T20:31:54.9863894Z D: int, 2025-05-07T20:31:54.9864121Z scale_ub: Optional[float], 2025-05-07T20:31:54.9864400Z contiguous: bool, 2025-05-07T20:31:54.9865118Z compiled: bool, 2025-05-07T20:31:54.9865360Z ) -> None: 2025-05-07T20:31:54.9865580Z torch.manual_seed(2025) 2025-05-07T20:31:54.9865834Z 2025-05-07T20:31:54.9866125Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:54.9866479Z 2025-05-07T20:31:54.9866675Z x_sign = torch.sign(x) 2025-05-07T20:31:54.9866976Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:54.9867300Z x = x_sign * x_clamp 2025-05-07T20:31:54.9867548Z x0 = x[:, :D] 2025-05-07T20:31:54.9867770Z x1 = x[:, D:] 2025-05-07T20:31:54.9867973Z 2025-05-07T20:31:54.9868163Z if contiguous: 2025-05-07T20:31:54.9868398Z x0 = x0.contiguous() 2025-05-07T20:31:54.9868659Z x1 = x1.contiguous() 2025-05-07T20:31:54.9868908Z 2025-05-07T20:31:54.9869105Z if scale_ub is not None: 2025-05-07T20:31:54.9869390Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:54.9869894Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:54.9870218Z ) 2025-05-07T20:31:54.9870413Z else: 2025-05-07T20:31:54.9870624Z scale_ub_tensor = None 2025-05-07T20:31:54.9870890Z 2025-05-07T20:31:54.9871127Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:54.9871452Z op = silu_mul_quant 2025-05-07T20:31:54.9871711Z if compiled: 2025-05-07T20:31:54.9871965Z op = torch.compile(op) 2025-05-07T20:31:54.9872274Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:54.9872566Z 2025-05-07T20:31:54.9872760Z > y_fp8, y_scale = fn() 2025-05-07T20:31:54.9872933Z 2025-05-07T20:31:54.9873034Z moe/activation_test.py:117: 2025-05-07T20:31:54.9873347Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:54.9873707Z moe/activation_test.py:115: in fn 2025-05-07T20:31:54.9874000Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:54.9874765Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:54.9875739Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:54.9876325Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:54.9877058Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:54.9877776Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:54.9878353Z kernel = self.compile( 2025-05-07T20:31:54.9878931Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:54.9879628Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:54.9880044Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:54.9880291Z 2025-05-07T20:31:54.9880513Z self = 2025-05-07T20:31:54.9881699Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:54.9883517Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1ee08d40d0>} 2025-05-07T20:31:54.9885012Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:54.9886133Z context = 2025-05-07T20:31:54.9886439Z 2025-05-07T20:31:54.9886748Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:54.9887293Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:54.9887792Z module_map=module_map) 2025-05-07T20:31:54.9888167Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:54.9888531Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:54.9888793Z E ^ 2025-05-07T20:31:54.9889285Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:54.9889770Z 2025-05-07T20:31:54.9890226Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:54.9890782Z 2025-05-07T20:31:54.9890890Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:54.9891313Z self=, 2025-05-07T20:31:54.9891745Z T=1, 2025-05-07T20:31:54.9891931Z D=7168, 2025-05-07T20:31:54.9892119Z scale_ub=None, 2025-05-07T20:31:54.9892335Z contiguous=True, 2025-05-07T20:31:54.9892557Z compiled=True, 2025-05-07T20:31:54.9892761Z ) 2025-05-07T20:31:54.9893086Z self = 2025-05-07T20:31:54.9893593Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:54.9893863Z 2025-05-07T20:31:54.9893940Z @given( 2025-05-07T20:31:54.9894170Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:54.9894494Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:54.9894809Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:54.9895143Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:54.9895511Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:54.9895827Z ) 2025-05-07T20:31:54.9896184Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:54.9896655Z def test_silu_mul_quant( 2025-05-07T20:31:54.9898388Z self, 2025-05-07T20:31:54.9898579Z T: int, 2025-05-07T20:31:54.9898778Z D: int, 2025-05-07T20:31:54.9899136Z scale_ub: Optional[float], 2025-05-07T20:31:54.9899410Z contiguous: bool, 2025-05-07T20:31:54.9899655Z compiled: bool, 2025-05-07T20:31:54.9899882Z ) -> None: 2025-05-07T20:31:54.9900105Z torch.manual_seed(2025) 2025-05-07T20:31:54.9900347Z 2025-05-07T20:31:54.9900623Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:54.9900982Z 2025-05-07T20:31:54.9901170Z x_sign = torch.sign(x) 2025-05-07T20:31:54.9901466Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:54.9901786Z x = x_sign * x_clamp 2025-05-07T20:31:54.9902027Z x0 = x[:, :D] 2025-05-07T20:31:54.9902246Z x1 = x[:, D:] 2025-05-07T20:31:54.9902453Z 2025-05-07T20:31:54.9902640Z if contiguous: 2025-05-07T20:31:54.9902871Z x0 = x0.contiguous() 2025-05-07T20:31:54.9903139Z x1 = x1.contiguous() 2025-05-07T20:31:54.9903377Z 2025-05-07T20:31:54.9903576Z if scale_ub is not None: 2025-05-07T20:31:54.9903858Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:54.9904194Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:54.9904514Z ) 2025-05-07T20:31:54.9904707Z else: 2025-05-07T20:31:54.9904936Z scale_ub_tensor = None 2025-05-07T20:31:54.9905217Z 2025-05-07T20:31:54.9905450Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:54.9905774Z op = silu_mul_quant 2025-05-07T20:31:54.9906018Z if compiled: 2025-05-07T20:31:54.9906270Z op = torch.compile(op) 2025-05-07T20:31:54.9906572Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:54.9906848Z 2025-05-07T20:31:54.9907043Z y_fp8, y_scale = fn() 2025-05-07T20:31:54.9907421Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:54.9907712Z 2025-05-07T20:31:54.9907957Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:54.9908307Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:54.9908603Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:54.9908927Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:54.9909298Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:54.9909619Z 2025-05-07T20:31:54.9909936Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:54.9910143Z 2025-05-07T20:31:54.9910264Z moe/activation_test.py:126: 2025-05-07T20:31:54.9910568Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:54.9910924Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:54.9911255Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:54.9912110Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:54.9912934Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:54.9913513Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:54.9914246Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:54.9914988Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:54.9915763Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:54.9916571Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:31:54.9917379Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:54.9918171Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:54.9918937Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:54.9919578Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:54.9920133Z fn() 2025-05-07T20:31:54.9920672Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:54.9921285Z self.fn.run( 2025-05-07T20:31:54.9921776Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:54.9922342Z kernel = self.compile( 2025-05-07T20:31:54.9922917Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:54.9923611Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:54.9924033Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:54.9924275Z 2025-05-07T20:31:54.9924499Z self = 2025-05-07T20:31:54.9925673Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:54.9927186Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1f21dbaa60>} 2025-05-07T20:31:54.9928656Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:54.9929764Z context = 2025-05-07T20:31:54.9930153Z 2025-05-07T20:31:54.9930331Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:54.9930885Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:54.9931377Z module_map=module_map) 2025-05-07T20:31:54.9931756Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:54.9932123Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:54.9932392Z E ^ 2025-05-07T20:31:54.9932884Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:54.9933371Z 2025-05-07T20:31:54.9933830Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:54.9934387Z 2025-05-07T20:31:54.9934489Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:54.9934926Z self=, 2025-05-07T20:31:54.9935348Z T=4096, 2025-05-07T20:31:54.9935539Z D=5120, 2025-05-07T20:31:54.9935725Z scale_ub=None, 2025-05-07T20:31:54.9935950Z contiguous=False, 2025-05-07T20:31:54.9936176Z compiled=False, 2025-05-07T20:31:54.9936376Z ) 2025-05-07T20:31:55.6202699Z W0507 20:31:55.615908 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:55.6203886Z W0507 20:31:55.615908 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Traceback (most recent call last): 2025-05-07T20:31:55.6205370Z W0507 20:31:55.615908 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:55.6207047Z W0507 20:31:55.615908 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:55.6208915Z W0507 20:31:55.615908 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:55.6210446Z W0507 20:31:55.615908 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:55.6211886Z W0507 20:31:55.615908 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:55.6213408Z W0507 20:31:55.615908 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:55.6214989Z W0507 20:31:55.615908 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:55.6216368Z W0507 20:31:55.615908 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] generator.visit(fn.parse()) 2025-05-07T20:31:55.6217705Z W0507 20:31:55.615908 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:55.6219039Z W0507 20:31:55.615908 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ret = super().visit(node) 2025-05-07T20:31:55.6220179Z W0507 20:31:55.615908 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit 2025-05-07T20:31:55.6221497Z W0507 20:31:55.615908 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] return visitor(node) 2025-05-07T20:31:55.6222838Z W0507 20:31:55.615908 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:55.6224244Z W0507 20:31:55.615908 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:55.6225466Z W0507 20:31:55.615908 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit 2025-05-07T20:31:55.6226610Z W0507 20:31:55.615908 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] self.visit(item) 2025-05-07T20:31:55.6227911Z W0507 20:31:55.615908 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:55.6229410Z W0507 20:31:55.615908 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:55.6230744Z W0507 20:31:55.615908 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:55.6231741Z W0507 20:31:55.615908 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:55.6232546Z W0507 20:31:55.615908 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^ 2025-05-07T20:31:55.6233757Z W0507 20:31:55.615908 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:56.2372202Z W0507 20:31:56.233270 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:56.2373729Z W0507 20:31:56.233270 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Traceback (most recent call last): 2025-05-07T20:31:56.2375294Z W0507 20:31:56.233270 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:56.2376887Z W0507 20:31:56.233270 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:56.2378444Z W0507 20:31:56.233270 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:56.2379977Z W0507 20:31:56.233270 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:56.2381420Z W0507 20:31:56.233270 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:56.2383197Z W0507 20:31:56.233270 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:56.2385128Z W0507 20:31:56.233270 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:56.2386500Z W0507 20:31:56.233270 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] generator.visit(fn.parse()) 2025-05-07T20:31:56.2387844Z W0507 20:31:56.233270 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:56.2389176Z W0507 20:31:56.233270 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ret = super().visit(node) 2025-05-07T20:31:56.2390428Z W0507 20:31:56.233270 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit 2025-05-07T20:31:56.2391553Z W0507 20:31:56.233270 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] return visitor(node) 2025-05-07T20:31:56.2392885Z W0507 20:31:56.233270 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:56.2394295Z W0507 20:31:56.233270 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:56.2395516Z W0507 20:31:56.233270 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit 2025-05-07T20:31:56.2396661Z W0507 20:31:56.233270 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] self.visit(item) 2025-05-07T20:31:56.2398111Z W0507 20:31:56.233270 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:56.2399609Z W0507 20:31:56.233270 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:56.2400771Z W0507 20:31:56.233270 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:56.2401763Z W0507 20:31:56.233270 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:56.2402565Z W0507 20:31:56.233270 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^ 2025-05-07T20:31:56.2403676Z W0507 20:31:56.233270 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:56.9977001Z W0507 20:31:56.993694 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:56.9978187Z W0507 20:31:56.993694 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Traceback (most recent call last): 2025-05-07T20:31:56.9979678Z W0507 20:31:56.993694 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:56.9981274Z W0507 20:31:56.993694 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:56.9983488Z W0507 20:31:56.993694 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:56.9985266Z W0507 20:31:56.993694 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:56.9986935Z W0507 20:31:56.993694 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:56.9988703Z W0507 20:31:56.993694 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:56.9990505Z W0507 20:31:56.993694 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:56.9991879Z W0507 20:31:56.993694 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] generator.visit(fn.parse()) 2025-05-07T20:31:56.9993220Z W0507 20:31:56.993694 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:56.9994554Z W0507 20:31:56.993694 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ret = super().visit(node) 2025-05-07T20:31:56.9995697Z W0507 20:31:56.993694 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit 2025-05-07T20:31:56.9996822Z W0507 20:31:56.993694 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] return visitor(node) 2025-05-07T20:31:56.9998309Z W0507 20:31:56.993694 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:56.9999733Z W0507 20:31:56.993694 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:57.0000955Z W0507 20:31:56.993694 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit 2025-05-07T20:31:57.0002097Z W0507 20:31:56.993694 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] self.visit(item) 2025-05-07T20:31:57.0003404Z W0507 20:31:56.993694 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:57.0004896Z W0507 20:31:56.993694 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:57.0006058Z W0507 20:31:56.993694 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:57.0007047Z W0507 20:31:56.993694 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:57.0007845Z W0507 20:31:56.993694 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^ 2025-05-07T20:31:57.0008955Z W0507 20:31:56.993694 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:57.0378061Z W0507 20:31:57.033100 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:57.0379212Z W0507 20:31:57.033100 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Traceback (most recent call last): 2025-05-07T20:31:57.0380680Z W0507 20:31:57.033100 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:57.0382249Z W0507 20:31:57.033100 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:57.0384018Z W0507 20:31:57.033100 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:57.0385541Z W0507 20:31:57.033100 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:57.0386978Z W0507 20:31:57.033100 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:57.0388492Z W0507 20:31:57.033100 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:57.0390169Z W0507 20:31:57.033100 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:57.0391743Z W0507 20:31:57.033100 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] generator.visit(fn.parse()) 2025-05-07T20:31:57.0393086Z W0507 20:31:57.033100 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:57.0394420Z W0507 20:31:57.033100 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ret = super().visit(node) 2025-05-07T20:31:57.0395563Z W0507 20:31:57.033100 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit 2025-05-07T20:31:57.0396689Z W0507 20:31:57.033100 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] return visitor(node) 2025-05-07T20:31:57.0398032Z W0507 20:31:57.033100 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:57.0399440Z W0507 20:31:57.033100 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:57.0400658Z W0507 20:31:57.033100 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit 2025-05-07T20:31:57.0401814Z W0507 20:31:57.033100 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] self.visit(item) 2025-05-07T20:31:57.0403108Z W0507 20:31:57.033100 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:57.0404732Z W0507 20:31:57.033100 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:57.0405913Z W0507 20:31:57.033100 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:57.0406926Z W0507 20:31:57.033100 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:57.0407725Z W0507 20:31:57.033100 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^ 2025-05-07T20:31:57.0408839Z W0507 20:31:57.033100 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:00.5808069Z self = 2025-05-07T20:32:00.5809033Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:00.5809501Z 2025-05-07T20:32:00.5809626Z @given( 2025-05-07T20:32:00.5809995Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:00.5810510Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:00.5811012Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:00.5811559Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:00.5812107Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:00.5812581Z ) 2025-05-07T20:32:00.5813102Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:00.5813754Z def test_silu_mul_quant( 2025-05-07T20:32:00.5814113Z self, 2025-05-07T20:32:00.5814391Z T: int, 2025-05-07T20:32:00.5814666Z D: int, 2025-05-07T20:32:00.5814973Z scale_ub: Optional[float], 2025-05-07T20:32:00.5815351Z contiguous: bool, 2025-05-07T20:32:00.5816132Z compiled: bool, 2025-05-07T20:32:00.5816517Z ) -> None: 2025-05-07T20:32:00.5816815Z torch.manual_seed(2025) 2025-05-07T20:32:00.5817159Z 2025-05-07T20:32:00.5817547Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:00.5818039Z 2025-05-07T20:32:00.5818311Z x_sign = torch.sign(x) 2025-05-07T20:32:00.5818729Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:00.5819176Z x = x_sign * x_clamp 2025-05-07T20:32:00.5819522Z x0 = x[:, :D] 2025-05-07T20:32:00.5819824Z x1 = x[:, D:] 2025-05-07T20:32:00.5820119Z 2025-05-07T20:32:00.5820380Z if contiguous: 2025-05-07T20:32:00.5820698Z x0 = x0.contiguous() 2025-05-07T20:32:00.5821071Z x1 = x1.contiguous() 2025-05-07T20:32:00.5821417Z 2025-05-07T20:32:00.5821683Z if scale_ub is not None: 2025-05-07T20:32:00.5822072Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:00.5822640Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:00.5823171Z ) 2025-05-07T20:32:00.5823492Z else: 2025-05-07T20:32:00.5823840Z scale_ub_tensor = None 2025-05-07T20:32:00.5824261Z 2025-05-07T20:32:00.5824650Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:00.5825199Z op = silu_mul_quant 2025-05-07T20:32:00.5825612Z if compiled: 2025-05-07T20:32:00.5826030Z op = torch.compile(op) 2025-05-07T20:32:00.5826588Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:00.5827059Z 2025-05-07T20:32:00.5827378Z > y_fp8, y_scale = fn() 2025-05-07T20:32:00.5827673Z 2025-05-07T20:32:00.5827837Z moe/activation_test.py:117: 2025-05-07T20:32:00.5828348Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:00.5829161Z moe/activation_test.py:115: in fn 2025-05-07T20:32:00.5829649Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:00.5831255Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:00.5832730Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:00.5833857Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:00.5835227Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:00.5836316Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:00.5837114Z kernel = self.compile( 2025-05-07T20:32:00.5837948Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:00.5839009Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:00.5839678Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:00.5840069Z 2025-05-07T20:32:00.5840401Z self = 2025-05-07T20:32:00.5842234Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:00.5844635Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1f21ae0700>} 2025-05-07T20:32:00.5846993Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:00.5848786Z context = 2025-05-07T20:32:00.5849281Z 2025-05-07T20:32:00.5849722Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:00.5850634Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:00.5851447Z module_map=module_map) 2025-05-07T20:32:00.5852047Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:00.5852641Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:00.5853074Z E ^ 2025-05-07T20:32:00.5853861Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:00.5854666Z 2025-05-07T20:32:00.5855393Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:00.5856294Z 2025-05-07T20:32:00.5856469Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:00.5857158Z self=, 2025-05-07T20:32:00.5857824Z T=4096, 2025-05-07T20:32:00.5858136Z D=7168, 2025-05-07T20:32:00.5858440Z scale_ub=None, 2025-05-07T20:32:00.5858778Z contiguous=False, 2025-05-07T20:32:00.5859146Z compiled=False, 2025-05-07T20:32:00.5859480Z ) 2025-05-07T20:32:00.5859996Z self = 2025-05-07T20:32:00.5860841Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:00.5861319Z 2025-05-07T20:32:00.5861444Z @given( 2025-05-07T20:32:00.5861806Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:00.5862313Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:00.5862820Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:00.5863371Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:00.5864033Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:00.5864508Z ) 2025-05-07T20:32:00.5865099Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:00.5865844Z def test_silu_mul_quant( 2025-05-07T20:32:00.5866226Z self, 2025-05-07T20:32:00.5866525Z T: int, 2025-05-07T20:32:00.5866826Z D: int, 2025-05-07T20:32:00.5867216Z scale_ub: Optional[float], 2025-05-07T20:32:00.5867636Z contiguous: bool, 2025-05-07T20:32:00.5867991Z compiled: bool, 2025-05-07T20:32:00.5868321Z ) -> None: 2025-05-07T20:32:00.5868633Z torch.manual_seed(2025) 2025-05-07T20:32:00.5869009Z 2025-05-07T20:32:00.5869394Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:00.5870110Z 2025-05-07T20:32:00.5870396Z x_sign = torch.sign(x) 2025-05-07T20:32:00.5870847Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:00.5871365Z x = x_sign * x_clamp 2025-05-07T20:32:00.5871742Z x0 = x[:, :D] 2025-05-07T20:32:00.5872078Z x1 = x[:, D:] 2025-05-07T20:32:00.5872401Z 2025-05-07T20:32:00.5872689Z if contiguous: 2025-05-07T20:32:00.5873041Z x0 = x0.contiguous() 2025-05-07T20:32:00.5873455Z x1 = x1.contiguous() 2025-05-07T20:32:00.5873841Z 2025-05-07T20:32:00.5874135Z if scale_ub is not None: 2025-05-07T20:32:00.5874577Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:00.5875123Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:00.5875635Z ) 2025-05-07T20:32:00.5875930Z else: 2025-05-07T20:32:00.5876257Z scale_ub_tensor = None 2025-05-07T20:32:00.5876666Z 2025-05-07T20:32:00.5877022Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:00.5877548Z op = silu_mul_quant 2025-05-07T20:32:00.5877956Z if compiled: 2025-05-07T20:32:00.5878351Z op = torch.compile(op) 2025-05-07T20:32:00.5878839Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:00.5879279Z 2025-05-07T20:32:00.5879574Z > y_fp8, y_scale = fn() 2025-05-07T20:32:00.5879850Z 2025-05-07T20:32:00.5880140Z moe/activation_test.py:117: 2025-05-07T20:32:00.5880636Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:00.5881184Z moe/activation_test.py:115: in fn 2025-05-07T20:32:00.5881644Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:00.5883098Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:00.5884326Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:00.5885239Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:00.5886420Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:00.5887614Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:00.5888539Z kernel = self.compile( 2025-05-07T20:32:00.5889472Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:00.5890615Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:00.5891288Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:00.5891682Z 2025-05-07T20:32:00.5892032Z self = 2025-05-07T20:32:00.5893901Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:00.5896377Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1f1c6960d0>} 2025-05-07T20:32:00.5898991Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:00.5900800Z context = 2025-05-07T20:32:00.5901298Z 2025-05-07T20:32:00.5901573Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:00.5902476Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:00.5903274Z module_map=module_map) 2025-05-07T20:32:00.5903865Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:00.5904434Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:00.5904859Z E ^ 2025-05-07T20:32:00.5905657Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:00.5906449Z 2025-05-07T20:32:00.5907176Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:00.5908063Z 2025-05-07T20:32:00.5908230Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:00.5908913Z self=, 2025-05-07T20:32:00.5909551Z T=128, 2025-05-07T20:32:00.5909922Z D=7168, 2025-05-07T20:32:00.5910221Z scale_ub=None, 2025-05-07T20:32:00.5910558Z contiguous=False, 2025-05-07T20:32:00.5910902Z compiled=True, 2025-05-07T20:32:00.5911218Z ) 2025-05-07T20:32:00.6710034Z self = 2025-05-07T20:32:00.6710941Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:00.6711393Z 2025-05-07T20:32:00.6711547Z @given( 2025-05-07T20:32:00.6711922Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:00.6712434Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:00.6713365Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:00.6713911Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:00.6714385Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:00.6714824Z ) 2025-05-07T20:32:00.6715355Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:00.6716051Z def test_silu_mul_quant( 2025-05-07T20:32:00.6716448Z self, 2025-05-07T20:32:00.6716742Z T: int, 2025-05-07T20:32:00.6717035Z D: int, 2025-05-07T20:32:00.6717383Z scale_ub: Optional[float], 2025-05-07T20:32:00.6717839Z contiguous: bool, 2025-05-07T20:32:00.6718215Z compiled: bool, 2025-05-07T20:32:00.6718583Z ) -> None: 2025-05-07T20:32:00.6718935Z torch.manual_seed(2025) 2025-05-07T20:32:00.6719359Z 2025-05-07T20:32:00.6719822Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:00.6720398Z 2025-05-07T20:32:00.6720697Z x_sign = torch.sign(x) 2025-05-07T20:32:00.6721143Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:00.6721595Z x = x_sign * x_clamp 2025-05-07T20:32:00.6721936Z x0 = x[:, :D] 2025-05-07T20:32:00.6722233Z x1 = x[:, D:] 2025-05-07T20:32:00.6722529Z 2025-05-07T20:32:00.6722791Z if contiguous: 2025-05-07T20:32:00.6723106Z x0 = x0.contiguous() 2025-05-07T20:32:00.6723460Z x1 = x1.contiguous() 2025-05-07T20:32:00.6723803Z 2025-05-07T20:32:00.6724066Z if scale_ub is not None: 2025-05-07T20:32:00.6724453Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:00.6724963Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:00.6725465Z ) 2025-05-07T20:32:00.6725748Z else: 2025-05-07T20:32:00.6726426Z scale_ub_tensor = None 2025-05-07T20:32:00.6726853Z 2025-05-07T20:32:00.6727227Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:00.6727720Z op = silu_mul_quant 2025-05-07T20:32:00.6728092Z if compiled: 2025-05-07T20:32:00.6728468Z op = torch.compile(op) 2025-05-07T20:32:00.6728921Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:00.6729351Z 2025-05-07T20:32:00.6729641Z y_fp8, y_scale = fn() 2025-05-07T20:32:00.6730100Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:00.6730566Z 2025-05-07T20:32:00.6730938Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:00.6731484Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:00.6731976Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:00.6732497Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:00.6733104Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:00.6733645Z 2025-05-07T20:32:00.6733977Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:00.6734302Z 2025-05-07T20:32:00.6734476Z moe/activation_test.py:126: 2025-05-07T20:32:00.6734942Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:00.6735490Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:00.6735995Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:00.6737267Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:00.6738601Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:00.6739571Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:00.6740781Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:00.6742023Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:00.6743466Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:00.6744828Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:00.6746184Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:00.6747495Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:00.6748505Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:00.6749401Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:00.6750386Z fn() 2025-05-07T20:32:00.6751232Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:00.6752264Z self.fn.run( 2025-05-07T20:32:00.6753026Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:00.6753955Z kernel = self.compile( 2025-05-07T20:32:00.6754885Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:00.6755999Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:00.6756669Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:00.6757075Z 2025-05-07T20:32:00.6757413Z self = 2025-05-07T20:32:00.6759309Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:00.6761970Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1f1d2114c0>} 2025-05-07T20:32:00.6764364Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:00.6766163Z context = 2025-05-07T20:32:00.6766671Z 2025-05-07T20:32:00.6766948Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:00.6767847Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:00.6768653Z module_map=module_map) 2025-05-07T20:32:00.6769253Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:00.6769841Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:00.6770289Z E ^ 2025-05-07T20:32:00.6771078Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:00.6771898Z 2025-05-07T20:32:00.6772636Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:00.6773539Z 2025-05-07T20:32:00.6773706Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:00.6774411Z self=, 2025-05-07T20:32:00.6775086Z T=128, 2025-05-07T20:32:00.6775382Z D=7168, 2025-05-07T20:32:00.6775680Z scale_ub=None, 2025-05-07T20:32:00.6776017Z contiguous=False, 2025-05-07T20:32:00.6776399Z compiled=False, 2025-05-07T20:32:00.6776747Z ) 2025-05-07T20:32:00.9290537Z self = 2025-05-07T20:32:00.9291182Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:00.9291611Z 2025-05-07T20:32:00.9291722Z @given( 2025-05-07T20:32:00.9292048Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:00.9292684Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:00.9293008Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:00.9293357Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:00.9293704Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:00.9294008Z ) 2025-05-07T20:32:00.9294370Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:00.9294849Z def test_silu_mul_quant( 2025-05-07T20:32:00.9295099Z self, 2025-05-07T20:32:00.9295288Z T: int, 2025-05-07T20:32:00.9295491Z D: int, 2025-05-07T20:32:00.9295721Z scale_ub: Optional[float], 2025-05-07T20:32:00.9295999Z contiguous: bool, 2025-05-07T20:32:00.9296251Z compiled: bool, 2025-05-07T20:32:00.9296496Z ) -> None: 2025-05-07T20:32:00.9296714Z torch.manual_seed(2025) 2025-05-07T20:32:00.9296973Z 2025-05-07T20:32:00.9297273Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:00.9297635Z 2025-05-07T20:32:00.9297871Z x_sign = torch.sign(x) 2025-05-07T20:32:00.9298174Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:00.9298508Z x = x_sign * x_clamp 2025-05-07T20:32:00.9298763Z x0 = x[:, :D] 2025-05-07T20:32:00.9298985Z x1 = x[:, D:] 2025-05-07T20:32:00.9299206Z 2025-05-07T20:32:00.9299399Z if contiguous: 2025-05-07T20:32:00.9299636Z x0 = x0.contiguous() 2025-05-07T20:32:00.9299910Z x1 = x1.contiguous() 2025-05-07T20:32:00.9300164Z 2025-05-07T20:32:00.9300359Z if scale_ub is not None: 2025-05-07T20:32:00.9300647Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:00.9301002Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:00.9301497Z ) 2025-05-07T20:32:00.9301686Z else: 2025-05-07T20:32:00.9301900Z scale_ub_tensor = None 2025-05-07T20:32:00.9302164Z 2025-05-07T20:32:00.9302401Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:00.9302726Z op = silu_mul_quant 2025-05-07T20:32:00.9302983Z if compiled: 2025-05-07T20:32:00.9303233Z op = torch.compile(op) 2025-05-07T20:32:00.9303541Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:00.9303830Z 2025-05-07T20:32:00.9304017Z > y_fp8, y_scale = fn() 2025-05-07T20:32:00.9304193Z 2025-05-07T20:32:00.9304295Z moe/activation_test.py:117: 2025-05-07T20:32:00.9304602Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:00.9304954Z moe/activation_test.py:115: in fn 2025-05-07T20:32:00.9305236Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:00.9305987Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:00.9306758Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:00.9307327Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:00.9308070Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:00.9308785Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:00.9309360Z kernel = self.compile( 2025-05-07T20:32:00.9310070Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:00.9310780Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:00.9311197Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:00.9311441Z 2025-05-07T20:32:00.9311660Z self = 2025-05-07T20:32:00.9312923Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:00.9314746Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1ee0208dc0>} 2025-05-07T20:32:00.9316450Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:00.9317728Z context = 2025-05-07T20:32:00.9318070Z 2025-05-07T20:32:00.9318247Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:00.9318797Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:00.9319292Z module_map=module_map) 2025-05-07T20:32:00.9319676Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:00.9320035Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:00.9320299Z E ^ 2025-05-07T20:32:00.9320795Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:00.9321290Z 2025-05-07T20:32:00.9321746Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:00.9322302Z 2025-05-07T20:32:00.9322405Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:00.9322836Z self=, 2025-05-07T20:32:00.9323262Z T=4096, 2025-05-07T20:32:00.9323443Z D=5120, 2025-05-07T20:32:00.9323730Z scale_ub=1200.0, 2025-05-07T20:32:00.9323955Z contiguous=True, 2025-05-07T20:32:00.9324175Z compiled=False, 2025-05-07T20:32:00.9324387Z ) 2025-05-07T20:32:00.9324718Z self = 2025-05-07T20:32:00.9325240Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:00.9325531Z 2025-05-07T20:32:00.9325610Z @given( 2025-05-07T20:32:00.9325840Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:00.9326164Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:00.9326475Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:00.9326816Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:00.9327159Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:00.9327452Z ) 2025-05-07T20:32:00.9327819Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:00.9328296Z def test_silu_mul_quant( 2025-05-07T20:32:00.9328547Z self, 2025-05-07T20:32:00.9328754Z T: int, 2025-05-07T20:32:00.9328955Z D: int, 2025-05-07T20:32:00.9329179Z scale_ub: Optional[float], 2025-05-07T20:32:00.9329454Z contiguous: bool, 2025-05-07T20:32:00.9329704Z compiled: bool, 2025-05-07T20:32:00.9329933Z ) -> None: 2025-05-07T20:32:00.9330144Z torch.manual_seed(2025) 2025-05-07T20:32:00.9330391Z 2025-05-07T20:32:00.9330670Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:00.9331027Z 2025-05-07T20:32:00.9331226Z x_sign = torch.sign(x) 2025-05-07T20:32:00.9331526Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:00.9340622Z x = x_sign * x_clamp 2025-05-07T20:32:00.9340917Z x0 = x[:, :D] 2025-05-07T20:32:00.9341152Z x1 = x[:, D:] 2025-05-07T20:32:00.9341365Z 2025-05-07T20:32:00.9341565Z if contiguous: 2025-05-07T20:32:00.9341823Z x0 = x0.contiguous() 2025-05-07T20:32:00.9342088Z x1 = x1.contiguous() 2025-05-07T20:32:00.9342342Z 2025-05-07T20:32:00.9342546Z if scale_ub is not None: 2025-05-07T20:32:00.9342948Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:00.9343305Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:00.9343634Z ) 2025-05-07T20:32:00.9343828Z else: 2025-05-07T20:32:00.9344047Z scale_ub_tensor = None 2025-05-07T20:32:00.9344312Z 2025-05-07T20:32:00.9344547Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:00.9344880Z op = silu_mul_quant 2025-05-07T20:32:00.9345141Z if compiled: 2025-05-07T20:32:00.9345398Z op = torch.compile(op) 2025-05-07T20:32:00.9345703Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:00.9345997Z 2025-05-07T20:32:00.9346193Z > y_fp8, y_scale = fn() 2025-05-07T20:32:00.9346363Z 2025-05-07T20:32:00.9346467Z moe/activation_test.py:117: 2025-05-07T20:32:00.9346781Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:00.9347140Z moe/activation_test.py:115: in fn 2025-05-07T20:32:00.9347428Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:00.9348198Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:00.9348962Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:00.9349536Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:00.9350461Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:00.9351263Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:00.9351903Z kernel = self.compile( 2025-05-07T20:32:00.9352540Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:00.9353336Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:00.9353754Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:00.9353996Z 2025-05-07T20:32:00.9354215Z self = 2025-05-07T20:32:00.9355385Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:00.9356892Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1ee1fe4dc0>} 2025-05-07T20:32:00.9358359Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:00.9359472Z context = 2025-05-07T20:32:00.9359782Z 2025-05-07T20:32:00.9359958Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:00.9360501Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:00.9360995Z module_map=module_map) 2025-05-07T20:32:00.9361372Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:00.9361729Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:00.9361994Z E ^ 2025-05-07T20:32:00.9362493Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:00.9362982Z 2025-05-07T20:32:00.9363436Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:00.9363997Z 2025-05-07T20:32:00.9364099Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:00.9364613Z self=, 2025-05-07T20:32:00.9365040Z T=1, 2025-05-07T20:32:00.9365215Z D=5120, 2025-05-07T20:32:00.9365407Z scale_ub=None, 2025-05-07T20:32:00.9365623Z contiguous=True, 2025-05-07T20:32:00.9365847Z compiled=True, 2025-05-07T20:32:00.9366051Z ) 2025-05-07T20:32:01.4706991Z W0507 20:32:01.466499 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:01.4708286Z W0507 20:32:01.466499 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Traceback (most recent call last): 2025-05-07T20:32:01.4709847Z W0507 20:32:01.466499 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:01.4711458Z W0507 20:32:01.466499 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:01.4712989Z W0507 20:32:01.466499 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:01.4714519Z W0507 20:32:01.466499 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:01.4715962Z W0507 20:32:01.466499 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:01.4717770Z W0507 20:32:01.466499 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:01.4719335Z W0507 20:32:01.466499 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:01.4720710Z W0507 20:32:01.466499 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] generator.visit(fn.parse()) 2025-05-07T20:32:01.4722052Z W0507 20:32:01.466499 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:01.4723388Z W0507 20:32:01.466499 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ret = super().visit(node) 2025-05-07T20:32:01.4724526Z W0507 20:32:01.466499 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit 2025-05-07T20:32:01.4725640Z W0507 20:32:01.466499 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] return visitor(node) 2025-05-07T20:32:01.4726980Z W0507 20:32:01.466499 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:01.4728390Z W0507 20:32:01.466499 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:01.4729614Z W0507 20:32:01.466499 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit 2025-05-07T20:32:01.4730903Z W0507 20:32:01.466499 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] self.visit(item) 2025-05-07T20:32:01.4732194Z W0507 20:32:01.466499 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:01.4733689Z W0507 20:32:01.466499 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:01.4734848Z W0507 20:32:01.466499 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:01.4735836Z W0507 20:32:01.466499 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:01.4736631Z W0507 20:32:01.466499 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^ 2025-05-07T20:32:01.4737754Z W0507 20:32:01.466499 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:01.6584689Z W0507 20:32:01.654495 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:01.6586127Z W0507 20:32:01.654495 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Traceback (most recent call last): 2025-05-07T20:32:01.6587765Z W0507 20:32:01.654495 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:01.6589629Z W0507 20:32:01.654495 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:01.6591281Z W0507 20:32:01.654495 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:01.6592812Z W0507 20:32:01.654495 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:01.6594256Z W0507 20:32:01.654495 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:01.6595786Z W0507 20:32:01.654495 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:01.6597354Z W0507 20:32:01.654495 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:01.6598722Z W0507 20:32:01.654495 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] generator.visit(fn.parse()) 2025-05-07T20:32:01.6600078Z W0507 20:32:01.654495 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:01.6601413Z W0507 20:32:01.654495 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ret = super().visit(node) 2025-05-07T20:32:01.6602709Z W0507 20:32:01.654495 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit 2025-05-07T20:32:01.6603826Z W0507 20:32:01.654495 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] return visitor(node) 2025-05-07T20:32:01.6605157Z W0507 20:32:01.654495 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:01.6606564Z W0507 20:32:01.654495 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:01.6607781Z W0507 20:32:01.654495 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit 2025-05-07T20:32:01.6608922Z W0507 20:32:01.654495 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] self.visit(item) 2025-05-07T20:32:01.6610221Z W0507 20:32:01.654495 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:01.6611711Z W0507 20:32:01.654495 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:01.6612873Z W0507 20:32:01.654495 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:01.6613860Z W0507 20:32:01.654495 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:01.6614772Z W0507 20:32:01.654495 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^ 2025-05-07T20:32:01.6615885Z W0507 20:32:01.654495 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:02.1657469Z W0507 20:32:02.161635 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:02.1658671Z W0507 20:32:02.161635 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Traceback (most recent call last): 2025-05-07T20:32:02.1661383Z W0507 20:32:02.161635 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:02.1662988Z W0507 20:32:02.161635 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:02.1664520Z W0507 20:32:02.161635 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:02.1666037Z W0507 20:32:02.161635 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:02.1667475Z W0507 20:32:02.161635 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:02.1668988Z W0507 20:32:02.161635 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:02.1670950Z W0507 20:32:02.161635 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:02.1672328Z W0507 20:32:02.161635 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] generator.visit(fn.parse()) 2025-05-07T20:32:02.1673666Z W0507 20:32:02.161635 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:02.1674988Z W0507 20:32:02.161635 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ret = super().visit(node) 2025-05-07T20:32:02.1676124Z W0507 20:32:02.161635 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit 2025-05-07T20:32:02.1677245Z W0507 20:32:02.161635 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] return visitor(node) 2025-05-07T20:32:02.1678579Z W0507 20:32:02.161635 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:02.1679980Z W0507 20:32:02.161635 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:02.1681197Z W0507 20:32:02.161635 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit 2025-05-07T20:32:02.1682486Z W0507 20:32:02.161635 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] self.visit(item) 2025-05-07T20:32:02.1684201Z W0507 20:32:02.161635 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:02.1685696Z W0507 20:32:02.161635 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:02.1686847Z W0507 20:32:02.161635 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:02.1687837Z W0507 20:32:02.161635 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:02.1688635Z W0507 20:32:02.161635 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^ 2025-05-07T20:32:02.1689756Z W0507 20:32:02.161635 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:02.2054352Z W0507 20:32:02.201485 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:02.2056620Z W0507 20:32:02.201485 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Traceback (most recent call last): 2025-05-07T20:32:02.2059543Z W0507 20:32:02.201485 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:02.2062658Z W0507 20:32:02.201485 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:02.2064702Z W0507 20:32:02.201485 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:02.2066228Z W0507 20:32:02.201485 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:02.2067654Z W0507 20:32:02.201485 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:02.2069217Z W0507 20:32:02.201485 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:02.2070914Z W0507 20:32:02.201485 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:02.2072283Z W0507 20:32:02.201485 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] generator.visit(fn.parse()) 2025-05-07T20:32:02.2073618Z W0507 20:32:02.201485 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:02.2074940Z W0507 20:32:02.201485 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ret = super().visit(node) 2025-05-07T20:32:02.2076065Z W0507 20:32:02.201485 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit 2025-05-07T20:32:02.2077321Z W0507 20:32:02.201485 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] return visitor(node) 2025-05-07T20:32:02.2078655Z W0507 20:32:02.201485 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:02.2080060Z W0507 20:32:02.201485 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:02.2081269Z W0507 20:32:02.201485 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit 2025-05-07T20:32:02.2082403Z W0507 20:32:02.201485 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] self.visit(item) 2025-05-07T20:32:02.2083948Z W0507 20:32:02.201485 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:02.2085440Z W0507 20:32:02.201485 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:02.2086594Z W0507 20:32:02.201485 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:02.2087575Z W0507 20:32:02.201485 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:02.2088376Z W0507 20:32:02.201485 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^ 2025-05-07T20:32:02.2090119Z W0507 20:32:02.201485 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:02.5528823Z self = 2025-05-07T20:32:02.5529441Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:02.5529834Z 2025-05-07T20:32:02.5529917Z @given( 2025-05-07T20:32:02.5530154Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:02.5530494Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:02.5530803Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:02.5531148Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:02.5531487Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:02.5531784Z ) 2025-05-07T20:32:02.5532142Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:02.5532634Z def test_silu_mul_quant( 2025-05-07T20:32:02.5532881Z self, 2025-05-07T20:32:02.5533068Z T: int, 2025-05-07T20:32:02.5533270Z D: int, 2025-05-07T20:32:02.5533502Z scale_ub: Optional[float], 2025-05-07T20:32:02.5533774Z contiguous: bool, 2025-05-07T20:32:02.5534019Z compiled: bool, 2025-05-07T20:32:02.5534254Z ) -> None: 2025-05-07T20:32:02.5534465Z torch.manual_seed(2025) 2025-05-07T20:32:02.5534716Z 2025-05-07T20:32:02.5534996Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:02.5535351Z 2025-05-07T20:32:02.5535543Z x_sign = torch.sign(x) 2025-05-07T20:32:02.5535837Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:02.5536156Z x = x_sign * x_clamp 2025-05-07T20:32:02.5536396Z x0 = x[:, :D] 2025-05-07T20:32:02.5536615Z x1 = x[:, D:] 2025-05-07T20:32:02.5536848Z 2025-05-07T20:32:02.5537330Z if contiguous: 2025-05-07T20:32:02.5537563Z x0 = x0.contiguous() 2025-05-07T20:32:02.5537823Z x1 = x1.contiguous() 2025-05-07T20:32:02.5538058Z 2025-05-07T20:32:02.5538253Z if scale_ub is not None: 2025-05-07T20:32:02.5538533Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:02.5538873Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:02.5539193Z ) 2025-05-07T20:32:02.5539384Z else: 2025-05-07T20:32:02.5539588Z scale_ub_tensor = None 2025-05-07T20:32:02.5539847Z 2025-05-07T20:32:02.5540081Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:02.5540399Z op = silu_mul_quant 2025-05-07T20:32:02.5540652Z if compiled: 2025-05-07T20:32:02.5540904Z op = torch.compile(op) 2025-05-07T20:32:02.5541206Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:02.5541484Z 2025-05-07T20:32:02.5541687Z y_fp8, y_scale = fn() 2025-05-07T20:32:02.5541973Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:02.5542268Z 2025-05-07T20:32:02.5542508Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:02.5542859Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:02.5543154Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:02.5543480Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:02.5543853Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:02.5544169Z 2025-05-07T20:32:02.5544371Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:02.5544574Z 2025-05-07T20:32:02.5544683Z moe/activation_test.py:126: 2025-05-07T20:32:02.5544995Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:02.5545343Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:02.5545680Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:02.5546539Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:02.5547550Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:02.5548136Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:02.5548872Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:02.5549612Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:02.5550525Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:02.5551338Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:02.5552142Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:02.5552934Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:02.5553616Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:02.5554260Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:02.5554809Z fn() 2025-05-07T20:32:02.5555337Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:02.5555956Z self.fn.run( 2025-05-07T20:32:02.5556447Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:02.5557009Z kernel = self.compile( 2025-05-07T20:32:02.5557573Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:02.5558268Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:02.5558779Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:02.5559024Z 2025-05-07T20:32:02.5559240Z self = 2025-05-07T20:32:02.5560412Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:02.5561953Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1ee021f700>} 2025-05-07T20:32:02.5563419Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:02.5564527Z context = 2025-05-07T20:32:02.5564837Z 2025-05-07T20:32:02.5565008Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:02.5565558Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:02.5566051Z module_map=module_map) 2025-05-07T20:32:02.5566427Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:02.5566785Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:02.5567060Z E ^ 2025-05-07T20:32:02.5567550Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:02.5568037Z 2025-05-07T20:32:02.5568494Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:02.5569048Z 2025-05-07T20:32:02.5569148Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:02.5569579Z self=, 2025-05-07T20:32:02.5569999Z T=2048, 2025-05-07T20:32:02.5570182Z D=5120, 2025-05-07T20:32:02.5570380Z scale_ub=None, 2025-05-07T20:32:02.5570676Z contiguous=True, 2025-05-07T20:32:02.5570895Z compiled=True, 2025-05-07T20:32:02.5571101Z ) 2025-05-07T20:32:03.0501592Z W0507 20:32:03.046074 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:03.0502770Z W0507 20:32:03.046074 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Traceback (most recent call last): 2025-05-07T20:32:03.0504255Z W0507 20:32:03.046074 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:03.0505874Z W0507 20:32:03.046074 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:03.0507410Z W0507 20:32:03.046074 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:03.0508942Z W0507 20:32:03.046074 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:03.0510532Z W0507 20:32:03.046074 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:03.0512057Z W0507 20:32:03.046074 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:03.0513945Z W0507 20:32:03.046074 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:03.0515324Z W0507 20:32:03.046074 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] generator.visit(fn.parse()) 2025-05-07T20:32:03.0516670Z W0507 20:32:03.046074 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:03.0518053Z W0507 20:32:03.046074 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ret = super().visit(node) 2025-05-07T20:32:03.0519192Z W0507 20:32:03.046074 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit 2025-05-07T20:32:03.0520326Z W0507 20:32:03.046074 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] return visitor(node) 2025-05-07T20:32:03.0521662Z W0507 20:32:03.046074 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:03.0523083Z W0507 20:32:03.046074 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:03.0524304Z W0507 20:32:03.046074 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit 2025-05-07T20:32:03.0525451Z W0507 20:32:03.046074 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] self.visit(item) 2025-05-07T20:32:03.0526894Z W0507 20:32:03.046074 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:03.0528391Z W0507 20:32:03.046074 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:03.0529553Z W0507 20:32:03.046074 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:03.0530542Z W0507 20:32:03.046074 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:03.0531350Z W0507 20:32:03.046074 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^ 2025-05-07T20:32:03.0532465Z W0507 20:32:03.046074 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:03.2376732Z W0507 20:32:03.233623 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:03.2378290Z W0507 20:32:03.233623 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Traceback (most recent call last): 2025-05-07T20:32:03.2379779Z W0507 20:32:03.233623 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:03.2381366Z W0507 20:32:03.233623 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:03.2383453Z W0507 20:32:03.233623 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:03.2384984Z W0507 20:32:03.233623 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:03.2386428Z W0507 20:32:03.233623 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:03.2387948Z W0507 20:32:03.233623 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:03.2389522Z W0507 20:32:03.233623 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:03.2390996Z W0507 20:32:03.233623 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] generator.visit(fn.parse()) 2025-05-07T20:32:03.2392338Z W0507 20:32:03.233623 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:03.2393663Z W0507 20:32:03.233623 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ret = super().visit(node) 2025-05-07T20:32:03.2394797Z W0507 20:32:03.233623 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit 2025-05-07T20:32:03.2396063Z W0507 20:32:03.233623 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] return visitor(node) 2025-05-07T20:32:03.2397405Z W0507 20:32:03.233623 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:03.2398810Z W0507 20:32:03.233623 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:03.2400031Z W0507 20:32:03.233623 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit 2025-05-07T20:32:03.2401174Z W0507 20:32:03.233623 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] self.visit(item) 2025-05-07T20:32:03.2402476Z W0507 20:32:03.233623 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:03.2403972Z W0507 20:32:03.233623 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:03.2405129Z W0507 20:32:03.233623 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:03.2406122Z W0507 20:32:03.233623 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:03.2406930Z W0507 20:32:03.233623 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^ 2025-05-07T20:32:03.2408161Z W0507 20:32:03.233623 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:03.7439221Z W0507 20:32:03.739933 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:03.7440407Z W0507 20:32:03.739933 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Traceback (most recent call last): 2025-05-07T20:32:03.7441898Z W0507 20:32:03.739933 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:03.7443497Z W0507 20:32:03.739933 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:03.7445041Z W0507 20:32:03.739933 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:03.7446572Z W0507 20:32:03.739933 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:03.7448022Z W0507 20:32:03.739933 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:03.7449553Z W0507 20:32:03.739933 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:03.7451484Z W0507 20:32:03.739933 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:03.7452873Z W0507 20:32:03.739933 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] generator.visit(fn.parse()) 2025-05-07T20:32:03.7454219Z W0507 20:32:03.739933 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:03.7455551Z W0507 20:32:03.739933 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ret = super().visit(node) 2025-05-07T20:32:03.7456690Z W0507 20:32:03.739933 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit 2025-05-07T20:32:03.7457816Z W0507 20:32:03.739933 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] return visitor(node) 2025-05-07T20:32:03.7459214Z W0507 20:32:03.739933 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:03.7460618Z W0507 20:32:03.739933 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:03.7461838Z W0507 20:32:03.739933 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit 2025-05-07T20:32:03.7462982Z W0507 20:32:03.739933 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] self.visit(item) 2025-05-07T20:32:03.7464432Z W0507 20:32:03.739933 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:03.7465929Z W0507 20:32:03.739933 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:03.7467088Z W0507 20:32:03.739933 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:03.7468078Z W0507 20:32:03.739933 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:03.7468879Z W0507 20:32:03.739933 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^ 2025-05-07T20:32:03.7470164Z W0507 20:32:03.739933 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:03.7831384Z W0507 20:32:03.779295 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:03.7832524Z W0507 20:32:03.779295 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Traceback (most recent call last): 2025-05-07T20:32:03.7833985Z W0507 20:32:03.779295 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:03.7835542Z W0507 20:32:03.779295 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:03.7837208Z W0507 20:32:03.779295 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:03.7838726Z W0507 20:32:03.779295 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:03.7840150Z W0507 20:32:03.779295 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:03.7841646Z W0507 20:32:03.779295 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:03.7843201Z W0507 20:32:03.779295 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:03.7844578Z W0507 20:32:03.779295 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] generator.visit(fn.parse()) 2025-05-07T20:32:03.7845937Z W0507 20:32:03.779295 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:03.7847260Z W0507 20:32:03.779295 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ret = super().visit(node) 2025-05-07T20:32:03.7848385Z W0507 20:32:03.779295 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit 2025-05-07T20:32:03.7849627Z W0507 20:32:03.779295 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] return visitor(node) 2025-05-07T20:32:03.7850970Z W0507 20:32:03.779295 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:03.7852377Z W0507 20:32:03.779295 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:03.7853587Z W0507 20:32:03.779295 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit 2025-05-07T20:32:03.7854727Z W0507 20:32:03.779295 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] self.visit(item) 2025-05-07T20:32:03.7856019Z W0507 20:32:03.779295 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:03.7857520Z W0507 20:32:03.779295 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:03.7858680Z W0507 20:32:03.779295 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:03.7859664Z W0507 20:32:03.779295 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:03.7860466Z W0507 20:32:03.779295 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^ 2025-05-07T20:32:03.7861586Z W0507 20:32:03.779295 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:04.2758739Z self = 2025-05-07T20:32:04.2759809Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:04.2760209Z 2025-05-07T20:32:04.2760320Z @given( 2025-05-07T20:32:04.2760625Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:04.2761030Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:04.2761383Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:04.2761727Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:04.2762066Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:04.2762558Z ) 2025-05-07T20:32:04.2762935Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:04.2763402Z def test_silu_mul_quant( 2025-05-07T20:32:04.2763652Z self, 2025-05-07T20:32:04.2763857Z T: int, 2025-05-07T20:32:04.2764051Z D: int, 2025-05-07T20:32:04.2764272Z scale_ub: Optional[float], 2025-05-07T20:32:04.2764554Z contiguous: bool, 2025-05-07T20:32:04.2764796Z compiled: bool, 2025-05-07T20:32:04.2765038Z ) -> None: 2025-05-07T20:32:04.2765258Z torch.manual_seed(2025) 2025-05-07T20:32:04.2765501Z 2025-05-07T20:32:04.2765781Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:04.2766143Z 2025-05-07T20:32:04.2766344Z x_sign = torch.sign(x) 2025-05-07T20:32:04.2766647Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:04.2766964Z x = x_sign * x_clamp 2025-05-07T20:32:04.2767210Z x0 = x[:, :D] 2025-05-07T20:32:04.2767435Z x1 = x[:, D:] 2025-05-07T20:32:04.2767644Z 2025-05-07T20:32:04.2767835Z if contiguous: 2025-05-07T20:32:04.2768071Z x0 = x0.contiguous() 2025-05-07T20:32:04.2768327Z x1 = x1.contiguous() 2025-05-07T20:32:04.2768748Z 2025-05-07T20:32:04.2768938Z if scale_ub is not None: 2025-05-07T20:32:04.2769210Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:04.2769560Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:04.2769880Z ) 2025-05-07T20:32:04.2770068Z else: 2025-05-07T20:32:04.2770279Z scale_ub_tensor = None 2025-05-07T20:32:04.2770538Z 2025-05-07T20:32:04.2770773Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:04.2771091Z op = silu_mul_quant 2025-05-07T20:32:04.2771344Z if compiled: 2025-05-07T20:32:04.2771596Z op = torch.compile(op) 2025-05-07T20:32:04.2771895Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:04.2772180Z 2025-05-07T20:32:04.2772372Z y_fp8, y_scale = fn() 2025-05-07T20:32:04.2772656Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:04.2772958Z 2025-05-07T20:32:04.2773202Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:04.2773546Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:04.2773848Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:04.2774173Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:04.2774544Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:04.2774863Z 2025-05-07T20:32:04.2775071Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:04.2775272Z 2025-05-07T20:32:04.2775376Z moe/activation_test.py:126: 2025-05-07T20:32:04.2775674Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:04.2776032Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:04.2776373Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:04.2777227Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:04.2778062Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:04.2778647Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:04.2779570Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:04.2780311Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:04.2781089Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:04.2781904Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:04.2782718Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:04.2783773Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:04.2784464Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:04.2785108Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:04.2785663Z fn() 2025-05-07T20:32:04.2786222Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:04.2786851Z self.fn.run( 2025-05-07T20:32:04.2787376Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:04.2787970Z kernel = self.compile( 2025-05-07T20:32:04.2788553Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:04.2789259Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:04.2789811Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:04.2790059Z 2025-05-07T20:32:04.2790418Z self = 2025-05-07T20:32:04.2791602Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:04.2793129Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1d7dcbe790>} 2025-05-07T20:32:04.2794603Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:04.2795717Z context = 2025-05-07T20:32:04.2796022Z 2025-05-07T20:32:04.2796192Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:04.2796749Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:04.2797243Z module_map=module_map) 2025-05-07T20:32:04.2797618Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:04.2797988Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:04.2798262Z E ^ 2025-05-07T20:32:04.2798753Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:04.2799248Z 2025-05-07T20:32:04.2799696Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:04.2800258Z 2025-05-07T20:32:04.2800359Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:04.2800790Z self=, 2025-05-07T20:32:04.2801207Z T=128, 2025-05-07T20:32:04.2801403Z D=5120, 2025-05-07T20:32:04.2801598Z scale_ub=None, 2025-05-07T20:32:04.2801812Z contiguous=True, 2025-05-07T20:32:04.2802037Z compiled=True, 2025-05-07T20:32:04.2802246Z ) 2025-05-07T20:32:04.8082308Z W0507 20:32:04.804106 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:04.8083772Z W0507 20:32:04.804106 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Traceback (most recent call last): 2025-05-07T20:32:04.8085269Z W0507 20:32:04.804106 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:04.8086930Z W0507 20:32:04.804106 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:04.8088523Z W0507 20:32:04.804106 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:04.8090059Z W0507 20:32:04.804106 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:04.8091505Z W0507 20:32:04.804106 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:04.8093029Z W0507 20:32:04.804106 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:04.8094799Z W0507 20:32:04.804106 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:04.8096174Z W0507 20:32:04.804106 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] generator.visit(fn.parse()) 2025-05-07T20:32:04.8097518Z W0507 20:32:04.804106 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:04.8098850Z W0507 20:32:04.804106 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ret = super().visit(node) 2025-05-07T20:32:04.8099987Z W0507 20:32:04.804106 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit 2025-05-07T20:32:04.8101111Z W0507 20:32:04.804106 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] return visitor(node) 2025-05-07T20:32:04.8102455Z W0507 20:32:04.804106 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:04.8103872Z W0507 20:32:04.804106 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:04.8105096Z W0507 20:32:04.804106 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit 2025-05-07T20:32:04.8106237Z W0507 20:32:04.804106 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] self.visit(item) 2025-05-07T20:32:04.8107539Z W0507 20:32:04.804106 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:04.8109148Z W0507 20:32:04.804106 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:04.8110475Z W0507 20:32:04.804106 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:04.8111467Z W0507 20:32:04.804106 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:04.8112275Z W0507 20:32:04.804106 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^ 2025-05-07T20:32:04.8113386Z W0507 20:32:04.804106 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:05.0006322Z W0507 20:32:04.996531 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:05.0007511Z W0507 20:32:04.996531 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Traceback (most recent call last): 2025-05-07T20:32:05.0009003Z W0507 20:32:04.996531 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:05.0010578Z W0507 20:32:04.996531 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:05.0012118Z W0507 20:32:04.996531 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:05.0013942Z W0507 20:32:04.996531 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:05.0015393Z W0507 20:32:04.996531 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:05.0016911Z W0507 20:32:04.996531 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:05.0018475Z W0507 20:32:04.996531 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:05.0019861Z W0507 20:32:04.996531 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] generator.visit(fn.parse()) 2025-05-07T20:32:05.0021211Z W0507 20:32:04.996531 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:05.0022541Z W0507 20:32:04.996531 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ret = super().visit(node) 2025-05-07T20:32:05.0023677Z W0507 20:32:04.996531 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit 2025-05-07T20:32:05.0024786Z W0507 20:32:04.996531 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] return visitor(node) 2025-05-07T20:32:05.0026268Z W0507 20:32:04.996531 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:05.0027677Z W0507 20:32:04.996531 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:05.0028897Z W0507 20:32:04.996531 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit 2025-05-07T20:32:05.0030959Z W0507 20:32:04.996531 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] self.visit(item) 2025-05-07T20:32:05.0032245Z W0507 20:32:04.996531 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:05.0033750Z W0507 20:32:04.996531 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:05.0034905Z W0507 20:32:04.996531 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:05.0035890Z W0507 20:32:04.996531 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:05.0036684Z W0507 20:32:04.996531 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^ 2025-05-07T20:32:05.0037799Z W0507 20:32:04.996531 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:05.5111089Z W0507 20:32:05.506984 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:05.5113386Z W0507 20:32:05.506984 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Traceback (most recent call last): 2025-05-07T20:32:05.5116320Z W0507 20:32:05.506984 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:05.5118805Z W0507 20:32:05.506984 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:05.5120338Z W0507 20:32:05.506984 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:05.5122201Z W0507 20:32:05.506984 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:05.5123641Z W0507 20:32:05.506984 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:05.5125159Z W0507 20:32:05.506984 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:05.5126716Z W0507 20:32:05.506984 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:05.5128378Z W0507 20:32:05.506984 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] generator.visit(fn.parse()) 2025-05-07T20:32:05.5129721Z W0507 20:32:05.506984 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:05.5131049Z W0507 20:32:05.506984 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ret = super().visit(node) 2025-05-07T20:32:05.5132188Z W0507 20:32:05.506984 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit 2025-05-07T20:32:05.5133314Z W0507 20:32:05.506984 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] return visitor(node) 2025-05-07T20:32:05.5134659Z W0507 20:32:05.506984 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:05.5136066Z W0507 20:32:05.506984 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:05.5137277Z W0507 20:32:05.506984 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit 2025-05-07T20:32:05.5138416Z W0507 20:32:05.506984 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] self.visit(item) 2025-05-07T20:32:05.5139708Z W0507 20:32:05.506984 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:05.5141357Z W0507 20:32:05.506984 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:05.5142517Z W0507 20:32:05.506984 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:05.5143501Z W0507 20:32:05.506984 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:05.5144299Z W0507 20:32:05.506984 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^ 2025-05-07T20:32:05.5145411Z W0507 20:32:05.506984 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:05.5512796Z W0507 20:32:05.547371 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:05.5514132Z W0507 20:32:05.547371 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Traceback (most recent call last): 2025-05-07T20:32:05.5515594Z W0507 20:32:05.547371 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:05.5517173Z W0507 20:32:05.547371 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:05.5518788Z W0507 20:32:05.547371 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:05.5520533Z W0507 20:32:05.547371 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:05.5521978Z W0507 20:32:05.547371 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:05.5523493Z W0507 20:32:05.547371 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:05.5525063Z W0507 20:32:05.547371 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:05.5526439Z W0507 20:32:05.547371 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] generator.visit(fn.parse()) 2025-05-07T20:32:05.5527838Z W0507 20:32:05.547371 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:05.5529170Z W0507 20:32:05.547371 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ret = super().visit(node) 2025-05-07T20:32:05.5530577Z W0507 20:32:05.547371 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit 2025-05-07T20:32:05.5531700Z W0507 20:32:05.547371 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] return visitor(node) 2025-05-07T20:32:05.5533207Z W0507 20:32:05.547371 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:05.5534625Z W0507 20:32:05.547371 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:05.5535847Z W0507 20:32:05.547371 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit 2025-05-07T20:32:05.5536984Z W0507 20:32:05.547371 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] self.visit(item) 2025-05-07T20:32:05.5538284Z W0507 20:32:05.547371 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:05.5539790Z W0507 20:32:05.547371 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:05.5540953Z W0507 20:32:05.547371 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:05.5541945Z W0507 20:32:05.547371 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:05.5542738Z W0507 20:32:05.547371 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^ 2025-05-07T20:32:05.5543855Z W0507 20:32:05.547371 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:05.9900297Z self = 2025-05-07T20:32:05.9900991Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:05.9901281Z 2025-05-07T20:32:05.9901359Z @given( 2025-05-07T20:32:05.9901861Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.9902193Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.9902506Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.9902853Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.9903193Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.9903488Z ) 2025-05-07T20:32:05.9903848Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.9904317Z def test_silu_mul_quant( 2025-05-07T20:32:05.9904564Z self, 2025-05-07T20:32:05.9904755Z T: int, 2025-05-07T20:32:05.9904961Z D: int, 2025-05-07T20:32:05.9905184Z scale_ub: Optional[float], 2025-05-07T20:32:05.9905458Z contiguous: bool, 2025-05-07T20:32:05.9905705Z compiled: bool, 2025-05-07T20:32:05.9905939Z ) -> None: 2025-05-07T20:32:05.9906153Z torch.manual_seed(2025) 2025-05-07T20:32:05.9906401Z 2025-05-07T20:32:05.9906692Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.9907049Z 2025-05-07T20:32:05.9907246Z x_sign = torch.sign(x) 2025-05-07T20:32:05.9907555Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:05.9907915Z x = x_sign * x_clamp 2025-05-07T20:32:05.9908165Z x0 = x[:, :D] 2025-05-07T20:32:05.9908387Z x1 = x[:, D:] 2025-05-07T20:32:05.9908601Z 2025-05-07T20:32:05.9908782Z if contiguous: 2025-05-07T20:32:05.9909020Z x0 = x0.contiguous() 2025-05-07T20:32:05.9909286Z x1 = x1.contiguous() 2025-05-07T20:32:05.9909531Z 2025-05-07T20:32:05.9909901Z if scale_ub is not None: 2025-05-07T20:32:05.9910189Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:05.9910723Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:05.9911042Z ) 2025-05-07T20:32:05.9911236Z else: 2025-05-07T20:32:05.9911455Z scale_ub_tensor = None 2025-05-07T20:32:05.9911715Z 2025-05-07T20:32:05.9911954Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:05.9912276Z op = silu_mul_quant 2025-05-07T20:32:05.9912532Z if compiled: 2025-05-07T20:32:05.9912789Z op = torch.compile(op) 2025-05-07T20:32:05.9913087Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.9913376Z 2025-05-07T20:32:05.9913569Z y_fp8, y_scale = fn() 2025-05-07T20:32:05.9913859Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:05.9914152Z 2025-05-07T20:32:05.9914391Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:05.9914739Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:05.9915043Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:05.9915370Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:05.9915751Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:05.9916176Z 2025-05-07T20:32:05.9916415Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:05.9916666Z 2025-05-07T20:32:05.9916775Z moe/activation_test.py:126: 2025-05-07T20:32:05.9917081Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.9917441Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:05.9917784Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:05.9918639Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:05.9919455Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:05.9920038Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:05.9920779Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:05.9921678Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:05.9922450Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:05.9923264Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:05.9924069Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:05.9924850Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:05.9925528Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:05.9926169Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:05.9926725Z fn() 2025-05-07T20:32:05.9927260Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:05.9927892Z self.fn.run( 2025-05-07T20:32:05.9928391Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:05.9928959Z kernel = self.compile( 2025-05-07T20:32:05.9929533Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:05.9930236Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:05.9930654Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.9930903Z 2025-05-07T20:32:05.9931118Z self = 2025-05-07T20:32:05.9932299Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:05.9933918Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1d7db5ea60>} 2025-05-07T20:32:05.9935387Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:05.9936498Z context = 2025-05-07T20:32:05.9936802Z 2025-05-07T20:32:05.9936971Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:05.9937521Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:05.9938031Z module_map=module_map) 2025-05-07T20:32:05.9938447Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:05.9938816Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:05.9939089Z E ^ 2025-05-07T20:32:05.9939581Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:05.9940067Z 2025-05-07T20:32:05.9940512Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:05.9941073Z 2025-05-07T20:32:05.9941174Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.9941597Z self=, 2025-05-07T20:32:05.9942016Z T=4096, 2025-05-07T20:32:05.9942197Z D=5120, 2025-05-07T20:32:05.9942387Z scale_ub=None, 2025-05-07T20:32:05.9942599Z contiguous=True, 2025-05-07T20:32:05.9942814Z compiled=True, 2025-05-07T20:32:05.9943022Z ) 2025-05-07T20:32:06.5234940Z W0507 20:32:06.519389 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:06.5236132Z W0507 20:32:06.519389 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Traceback (most recent call last): 2025-05-07T20:32:06.5237616Z W0507 20:32:06.519389 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:06.5239254Z W0507 20:32:06.519389 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:06.5240783Z W0507 20:32:06.519389 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:06.5242319Z W0507 20:32:06.519389 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:06.5243760Z W0507 20:32:06.519389 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:06.5245277Z W0507 20:32:06.519389 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:06.5246839Z W0507 20:32:06.519389 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:06.5248368Z W0507 20:32:06.519389 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] generator.visit(fn.parse()) 2025-05-07T20:32:06.5249702Z W0507 20:32:06.519389 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:06.5251033Z W0507 20:32:06.519389 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ret = super().visit(node) 2025-05-07T20:32:06.5252165Z W0507 20:32:06.519389 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit 2025-05-07T20:32:06.5253280Z W0507 20:32:06.519389 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] return visitor(node) 2025-05-07T20:32:06.5254625Z W0507 20:32:06.519389 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:06.5256028Z W0507 20:32:06.519389 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:06.5257246Z W0507 20:32:06.519389 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit 2025-05-07T20:32:06.5258385Z W0507 20:32:06.519389 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] self.visit(item) 2025-05-07T20:32:06.5259679Z W0507 20:32:06.519389 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:06.5261260Z W0507 20:32:06.519389 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:06.5262414Z W0507 20:32:06.519389 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:06.5263404Z W0507 20:32:06.519389 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:06.5264208Z W0507 20:32:06.519389 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^ 2025-05-07T20:32:06.5265320Z W0507 20:32:06.519389 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:06.7133209Z W0507 20:32:06.709325 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:06.7134409Z W0507 20:32:06.709325 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Traceback (most recent call last): 2025-05-07T20:32:06.7135883Z W0507 20:32:06.709325 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:06.7137468Z W0507 20:32:06.709325 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:06.7138995Z W0507 20:32:06.709325 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:06.7140800Z W0507 20:32:06.709325 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:06.7142229Z W0507 20:32:06.709325 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:06.7143746Z W0507 20:32:06.709325 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:06.7145307Z W0507 20:32:06.709325 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:06.7146689Z W0507 20:32:06.709325 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] generator.visit(fn.parse()) 2025-05-07T20:32:06.7148035Z W0507 20:32:06.709325 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:06.7149358Z W0507 20:32:06.709325 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ret = super().visit(node) 2025-05-07T20:32:06.7150607Z W0507 20:32:06.709325 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit 2025-05-07T20:32:06.7151731Z W0507 20:32:06.709325 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] return visitor(node) 2025-05-07T20:32:06.7153076Z W0507 20:32:06.709325 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:06.7154627Z W0507 20:32:06.709325 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:06.7155845Z W0507 20:32:06.709325 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit 2025-05-07T20:32:06.7156992Z W0507 20:32:06.709325 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] self.visit(item) 2025-05-07T20:32:06.7158292Z W0507 20:32:06.709325 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:06.7159815Z W0507 20:32:06.709325 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:06.7160982Z W0507 20:32:06.709325 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:06.7161974Z W0507 20:32:06.709325 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:06.7162773Z W0507 20:32:06.709325 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^ 2025-05-07T20:32:06.7163881Z W0507 20:32:06.709325 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:07.2234849Z W0507 20:32:07.219366 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:07.2236387Z W0507 20:32:07.219366 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Traceback (most recent call last): 2025-05-07T20:32:07.2237876Z W0507 20:32:07.219366 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:07.2247532Z W0507 20:32:07.219366 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:07.2249090Z W0507 20:32:07.219366 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:07.2250646Z W0507 20:32:07.219366 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:07.2252098Z W0507 20:32:07.219366 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:07.2253619Z W0507 20:32:07.219366 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:07.2255189Z W0507 20:32:07.219366 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:07.2256569Z W0507 20:32:07.219366 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] generator.visit(fn.parse()) 2025-05-07T20:32:07.2258136Z W0507 20:32:07.219366 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:07.2259469Z W0507 20:32:07.219366 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ret = super().visit(node) 2025-05-07T20:32:07.2260606Z W0507 20:32:07.219366 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit 2025-05-07T20:32:07.2261726Z W0507 20:32:07.219366 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] return visitor(node) 2025-05-07T20:32:07.2263069Z W0507 20:32:07.219366 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:07.2264493Z W0507 20:32:07.219366 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:07.2265707Z W0507 20:32:07.219366 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit 2025-05-07T20:32:07.2266852Z W0507 20:32:07.219366 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] self.visit(item) 2025-05-07T20:32:07.2268153Z W0507 20:32:07.219366 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:07.2269653Z W0507 20:32:07.219366 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:07.2271031Z W0507 20:32:07.219366 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:07.2272024Z W0507 20:32:07.219366 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:07.2272827Z W0507 20:32:07.219366 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^ 2025-05-07T20:32:07.2273948Z W0507 20:32:07.219366 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:07.2634307Z W0507 20:32:07.259408 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:07.2636620Z W0507 20:32:07.259408 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Traceback (most recent call last): 2025-05-07T20:32:07.2638936Z W0507 20:32:07.259408 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:07.2640512Z W0507 20:32:07.259408 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:07.2642050Z W0507 20:32:07.259408 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:07.2643582Z W0507 20:32:07.259408 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:07.2645202Z W0507 20:32:07.259408 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:07.2646722Z W0507 20:32:07.259408 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:07.2648281Z W0507 20:32:07.259408 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:07.2649655Z W0507 20:32:07.259408 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] generator.visit(fn.parse()) 2025-05-07T20:32:07.2651002Z W0507 20:32:07.259408 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:07.2652330Z W0507 20:32:07.259408 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ret = super().visit(node) 2025-05-07T20:32:07.2653464Z W0507 20:32:07.259408 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit 2025-05-07T20:32:07.2654573Z W0507 20:32:07.259408 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] return visitor(node) 2025-05-07T20:32:07.2655913Z W0507 20:32:07.259408 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:07.2657460Z W0507 20:32:07.259408 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:07.2658682Z W0507 20:32:07.259408 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit 2025-05-07T20:32:07.2659873Z W0507 20:32:07.259408 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] self.visit(item) 2025-05-07T20:32:07.2661160Z W0507 20:32:07.259408 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:07.2662658Z W0507 20:32:07.259408 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:07.2663830Z W0507 20:32:07.259408 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:07.2664820Z W0507 20:32:07.259408 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:07.2665619Z W0507 20:32:07.259408 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^ 2025-05-07T20:32:07.2666729Z W0507 20:32:07.259408 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:07.7155715Z self = 2025-05-07T20:32:07.7156311Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:07.7156723Z 2025-05-07T20:32:07.7156873Z @given( 2025-05-07T20:32:07.7157185Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:07.7157598Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:07.7158314Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:07.7158666Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:07.7159013Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:07.7159307Z ) 2025-05-07T20:32:07.7159677Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:07.7160152Z def test_silu_mul_quant( 2025-05-07T20:32:07.7160398Z self, 2025-05-07T20:32:07.7160596Z T: int, 2025-05-07T20:32:07.7160795Z D: int, 2025-05-07T20:32:07.7161010Z scale_ub: Optional[float], 2025-05-07T20:32:07.7161291Z contiguous: bool, 2025-05-07T20:32:07.7161533Z compiled: bool, 2025-05-07T20:32:07.7161766Z ) -> None: 2025-05-07T20:32:07.7161987Z torch.manual_seed(2025) 2025-05-07T20:32:07.7162241Z 2025-05-07T20:32:07.7162510Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:07.7162875Z 2025-05-07T20:32:07.7163077Z x_sign = torch.sign(x) 2025-05-07T20:32:07.7163375Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:07.7163689Z x = x_sign * x_clamp 2025-05-07T20:32:07.7163934Z x0 = x[:, :D] 2025-05-07T20:32:07.7164154Z x1 = x[:, D:] 2025-05-07T20:32:07.7164359Z 2025-05-07T20:32:07.7164544Z if contiguous: 2025-05-07T20:32:07.7164782Z x0 = x0.contiguous() 2025-05-07T20:32:07.7165039Z x1 = x1.contiguous() 2025-05-07T20:32:07.7165284Z 2025-05-07T20:32:07.7165480Z if scale_ub is not None: 2025-05-07T20:32:07.7165752Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:07.7166097Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:07.7166419Z ) 2025-05-07T20:32:07.7166768Z else: 2025-05-07T20:32:07.7166981Z scale_ub_tensor = None 2025-05-07T20:32:07.7167245Z 2025-05-07T20:32:07.7167481Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:07.7167818Z op = silu_mul_quant 2025-05-07T20:32:07.7168083Z if compiled: 2025-05-07T20:32:07.7168336Z op = torch.compile(op) 2025-05-07T20:32:07.7168651Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:07.7168947Z 2025-05-07T20:32:07.7169150Z y_fp8, y_scale = fn() 2025-05-07T20:32:07.7169442Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:07.7169753Z 2025-05-07T20:32:07.7170004Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:07.7170348Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:07.7170656Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:07.7170989Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:07.7171367Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:07.7171704Z 2025-05-07T20:32:07.7171911Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:07.7172119Z 2025-05-07T20:32:07.7172236Z moe/activation_test.py:126: 2025-05-07T20:32:07.7172540Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:07.7172900Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:07.7173246Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:07.7174105Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:07.7174934Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:07.7175525Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:07.7176270Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:07.7177016Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:07.7177918Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:07.7178786Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:07.7179587Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:07.7180372Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:07.7181061Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:07.7181705Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:07.7182255Z fn() 2025-05-07T20:32:07.7183125Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:07.7183759Z self.fn.run( 2025-05-07T20:32:07.7184255Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:07.7184817Z kernel = self.compile( 2025-05-07T20:32:07.7185389Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:07.7186087Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:07.7186495Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:07.7186745Z 2025-05-07T20:32:07.7186962Z self = 2025-05-07T20:32:07.7188131Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:07.7189952Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1d7d2f1700>} 2025-05-07T20:32:07.7191422Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:07.7192522Z context = 2025-05-07T20:32:07.7192836Z 2025-05-07T20:32:07.7193007Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:07.7193557Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:07.7194048Z module_map=module_map) 2025-05-07T20:32:07.7194418Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:07.7194788Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:07.7195061Z E ^ 2025-05-07T20:32:07.7195550Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:07.7196042Z 2025-05-07T20:32:07.7196490Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:07.7197044Z 2025-05-07T20:32:07.7197153Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:07.7197581Z self=, 2025-05-07T20:32:07.7197996Z T=16384, 2025-05-07T20:32:07.7198191Z D=5120, 2025-05-07T20:32:07.7198385Z scale_ub=None, 2025-05-07T20:32:07.7198589Z contiguous=True, 2025-05-07T20:32:07.7198808Z compiled=True, 2025-05-07T20:32:07.7199011Z ) 2025-05-07T20:32:07.7661385Z W0507 20:32:07.764492 86873 site-packages/torch/_dynamo/convert_frame.py:987] [1/8] torch._dynamo hit config.recompile_limit (8) 2025-05-07T20:32:07.7663059Z W0507 20:32:07.764492 86873 site-packages/torch/_dynamo/convert_frame.py:987] [1/8] function: 'silu_mul_quant' (/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:55) 2025-05-07T20:32:07.7664534Z W0507 20:32:07.764492 86873 site-packages/torch/_dynamo/convert_frame.py:987] [1/8] last reason: 1/7: tensor 'x0' stride mismatch at index 0. expected 5120, actual 10240 2025-05-07T20:32:07.7665606Z W0507 20:32:07.764492 86873 site-packages/torch/_dynamo/convert_frame.py:987] [1/8] To log all recompilation reasons, use TORCH_LOGS="recompiles". 2025-05-07T20:32:07.7666812Z W0507 20:32:07.764492 86873 site-packages/torch/_dynamo/convert_frame.py:987] [1/8] To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html. 2025-05-07T20:32:07.8889628Z self = 2025-05-07T20:32:07.8890161Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:07.8890547Z 2025-05-07T20:32:07.8890657Z @given( 2025-05-07T20:32:07.8890944Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:07.8891281Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:07.8891596Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:07.8891929Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:07.8892263Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:07.8892562Z ) 2025-05-07T20:32:07.8892942Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:07.8893402Z def test_silu_mul_quant( 2025-05-07T20:32:07.8893652Z self, 2025-05-07T20:32:07.8893845Z T: int, 2025-05-07T20:32:07.8894034Z D: int, 2025-05-07T20:32:07.8894255Z scale_ub: Optional[float], 2025-05-07T20:32:07.8894529Z contiguous: bool, 2025-05-07T20:32:07.8895026Z compiled: bool, 2025-05-07T20:32:07.8895254Z ) -> None: 2025-05-07T20:32:07.8895472Z torch.manual_seed(2025) 2025-05-07T20:32:07.8895720Z 2025-05-07T20:32:07.8895996Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:07.8896359Z 2025-05-07T20:32:07.8896556Z x_sign = torch.sign(x) 2025-05-07T20:32:07.8896847Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:07.8897167Z x = x_sign * x_clamp 2025-05-07T20:32:07.8897410Z x0 = x[:, :D] 2025-05-07T20:32:07.8897623Z x1 = x[:, D:] 2025-05-07T20:32:07.8897832Z 2025-05-07T20:32:07.8898018Z if contiguous: 2025-05-07T20:32:07.8898257Z x0 = x0.contiguous() 2025-05-07T20:32:07.8898516Z x1 = x1.contiguous() 2025-05-07T20:32:07.8898765Z 2025-05-07T20:32:07.8898961Z if scale_ub is not None: 2025-05-07T20:32:07.8899234Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:07.8899579Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:07.8899902Z ) 2025-05-07T20:32:07.8900088Z else: 2025-05-07T20:32:07.8900300Z scale_ub_tensor = None 2025-05-07T20:32:07.8900559Z 2025-05-07T20:32:07.8900790Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:07.8901113Z op = silu_mul_quant 2025-05-07T20:32:07.8901366Z if compiled: 2025-05-07T20:32:07.8901610Z op = torch.compile(op) 2025-05-07T20:32:07.8901914Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:07.8902197Z 2025-05-07T20:32:07.8902383Z y_fp8, y_scale = fn() 2025-05-07T20:32:07.8902671Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:07.8902975Z 2025-05-07T20:32:07.8903215Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:07.8903556Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:07.8903857Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:07.8904186Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:07.8904548Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:07.8904872Z 2025-05-07T20:32:07.8905231Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:07.8905436Z 2025-05-07T20:32:07.8905536Z moe/activation_test.py:126: 2025-05-07T20:32:07.8905843Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:07.8906192Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:07.8906532Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:07.8907380Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:07.8908207Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:07.8908828Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:07.8909558Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:07.8910452Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:07.8911230Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:07.8912039Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:07.8912841Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:07.8913625Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:07.8914310Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:07.8914954Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:07.8915591Z fn() 2025-05-07T20:32:07.8916130Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:07.8916752Z self.fn.run( 2025-05-07T20:32:07.8917241Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:07.8917807Z kernel = self.compile( 2025-05-07T20:32:07.8918403Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:07.8919125Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:07.8919533Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:07.8919783Z 2025-05-07T20:32:07.8919996Z self = 2025-05-07T20:32:07.8921173Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:07.8922696Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1d7db55700>} 2025-05-07T20:32:07.8924161Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:07.8925267Z context = 2025-05-07T20:32:07.8925576Z 2025-05-07T20:32:07.8925747Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:07.8926295Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:07.8926778Z module_map=module_map) 2025-05-07T20:32:07.8927160Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:07.8927525Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:07.8927799Z E ^ 2025-05-07T20:32:07.8928367Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:07.8928864Z 2025-05-07T20:32:07.8929311Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:07.8929866Z 2025-05-07T20:32:07.8929975Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:07.8930395Z self=, 2025-05-07T20:32:07.8930812Z T=1, 2025-05-07T20:32:07.8930990Z D=5120, 2025-05-07T20:32:07.8931186Z scale_ub=1200.0, 2025-05-07T20:32:07.8931404Z contiguous=True, 2025-05-07T20:32:07.8931626Z compiled=True, 2025-05-07T20:32:07.8931832Z ) 2025-05-07T20:32:08.2742564Z self = 2025-05-07T20:32:08.2743485Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:08.2743934Z 2025-05-07T20:32:08.2744074Z @given( 2025-05-07T20:32:08.2744452Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:08.2744967Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:08.2745475Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:08.2746007Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:08.2746552Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:08.2747015Z ) 2025-05-07T20:32:08.2747623Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:08.2748374Z def test_silu_mul_quant( 2025-05-07T20:32:08.2748735Z self, 2025-05-07T20:32:08.2749010Z T: int, 2025-05-07T20:32:08.2749301Z D: int, 2025-05-07T20:32:08.2749626Z scale_ub: Optional[float], 2025-05-07T20:32:08.2750736Z contiguous: bool, 2025-05-07T20:32:08.2751114Z compiled: bool, 2025-05-07T20:32:08.2751446Z ) -> None: 2025-05-07T20:32:08.2751770Z torch.manual_seed(2025) 2025-05-07T20:32:08.2752178Z 2025-05-07T20:32:08.2752623Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:08.2753199Z 2025-05-07T20:32:08.2753517Z x_sign = torch.sign(x) 2025-05-07T20:32:08.2754006Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:08.2754529Z x = x_sign * x_clamp 2025-05-07T20:32:08.2754934Z x0 = x[:, :D] 2025-05-07T20:32:08.2755290Z x1 = x[:, D:] 2025-05-07T20:32:08.2755627Z 2025-05-07T20:32:08.2755931Z if contiguous: 2025-05-07T20:32:08.2756312Z x0 = x0.contiguous() 2025-05-07T20:32:08.2756731Z x1 = x1.contiguous() 2025-05-07T20:32:08.2757138Z 2025-05-07T20:32:08.2757456Z if scale_ub is not None: 2025-05-07T20:32:08.2757930Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:08.2758496Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:08.2759022Z ) 2025-05-07T20:32:08.2759350Z else: 2025-05-07T20:32:08.2759701Z scale_ub_tensor = None 2025-05-07T20:32:08.2760130Z 2025-05-07T20:32:08.2760516Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:08.2761050Z op = silu_mul_quant 2025-05-07T20:32:08.2761474Z if compiled: 2025-05-07T20:32:08.2761891Z op = torch.compile(op) 2025-05-07T20:32:08.2762385Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:08.2762845Z 2025-05-07T20:32:08.2763171Z > y_fp8, y_scale = fn() 2025-05-07T20:32:08.2763452Z 2025-05-07T20:32:08.2763618Z moe/activation_test.py:117: 2025-05-07T20:32:08.2764121Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:08.2764694Z moe/activation_test.py:115: in fn 2025-05-07T20:32:08.2765187Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:08.2766149Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:08.2767905Z return fn(*args, **kwargs) 2025-05-07T20:32:08.2769099Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:08.2770280Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:08.2771149Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:08.2772296Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:08.2773464Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:08.2774418Z kernel = self.compile( 2025-05-07T20:32:08.2775368Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:08.2776537Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:08.2777238Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:08.2777642Z 2025-05-07T20:32:08.2777989Z self = 2025-05-07T20:32:08.2779926Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:08.2782424Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1d7dabf5e0>} 2025-05-07T20:32:08.2785170Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:08.2787226Z context = 2025-05-07T20:32:08.2787758Z 2025-05-07T20:32:08.2788046Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:08.2789015Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:08.2789951Z module_map=module_map) 2025-05-07T20:32:08.2790558Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:08.2791137Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:08.2791566Z E ^ 2025-05-07T20:32:08.2792366Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:08.2793187Z 2025-05-07T20:32:08.2793929Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:08.2794872Z 2025-05-07T20:32:08.2795045Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:08.2795758Z self=, 2025-05-07T20:32:08.2796447Z T=1, 2025-05-07T20:32:08.2796760Z D=5120, 2025-05-07T20:32:08.2797073Z scale_ub=None, 2025-05-07T20:32:08.2797426Z contiguous=False, 2025-05-07T20:32:08.2797803Z compiled=True, 2025-05-07T20:32:08.2798137Z ) 2025-05-07T20:32:08.3609195Z self = 2025-05-07T20:32:08.3610081Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:08.3610521Z 2025-05-07T20:32:08.3610648Z @given( 2025-05-07T20:32:08.3611017Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:08.3611544Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:08.3612032Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:08.3612608Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:08.3613151Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:08.3613530Z ) 2025-05-07T20:32:08.3614325Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:08.3615030Z def test_silu_mul_quant( 2025-05-07T20:32:08.3615395Z self, 2025-05-07T20:32:08.3615701Z T: int, 2025-05-07T20:32:08.3616007Z D: int, 2025-05-07T20:32:08.3616356Z scale_ub: Optional[float], 2025-05-07T20:32:08.3616775Z contiguous: bool, 2025-05-07T20:32:08.3617157Z compiled: bool, 2025-05-07T20:32:08.3617529Z ) -> None: 2025-05-07T20:32:08.3617882Z torch.manual_seed(2025) 2025-05-07T20:32:08.3618289Z 2025-05-07T20:32:08.3618748Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:08.3619321Z 2025-05-07T20:32:08.3619642Z x_sign = torch.sign(x) 2025-05-07T20:32:08.3620135Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:08.3620667Z x = x_sign * x_clamp 2025-05-07T20:32:08.3621073Z x0 = x[:, :D] 2025-05-07T20:32:08.3621435Z x1 = x[:, D:] 2025-05-07T20:32:08.3621780Z 2025-05-07T20:32:08.3622086Z if contiguous: 2025-05-07T20:32:08.3634089Z x0 = x0.contiguous() 2025-05-07T20:32:08.3634532Z x1 = x1.contiguous() 2025-05-07T20:32:08.3634928Z 2025-05-07T20:32:08.3635240Z if scale_ub is not None: 2025-05-07T20:32:08.3635690Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:08.3636271Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:08.3636793Z ) 2025-05-07T20:32:08.3637118Z else: 2025-05-07T20:32:08.3637460Z scale_ub_tensor = None 2025-05-07T20:32:08.3637886Z 2025-05-07T20:32:08.3638275Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:08.3638811Z op = silu_mul_quant 2025-05-07T20:32:08.3639520Z if compiled: 2025-05-07T20:32:08.3639943Z op = torch.compile(op) 2025-05-07T20:32:08.3640436Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:08.3640910Z 2025-05-07T20:32:08.3641240Z y_fp8, y_scale = fn() 2025-05-07T20:32:08.3641713Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:08.3642212Z 2025-05-07T20:32:08.3642606Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:08.3643178Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:08.3643683Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:08.3644225Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:08.3644841Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:08.3645367Z 2025-05-07T20:32:08.3645701Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:08.3646044Z 2025-05-07T20:32:08.3646218Z moe/activation_test.py:126: 2025-05-07T20:32:08.3646733Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:08.3647312Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:08.3647876Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:08.3649316Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:08.3650667Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:08.3651617Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:08.3652784Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:08.3654005Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:08.3655302Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:08.3656650Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:08.3658144Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:08.3659502Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:08.3660642Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:08.3661707Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:08.3662625Z fn() 2025-05-07T20:32:08.3663525Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:08.3664567Z self.fn.run( 2025-05-07T20:32:08.3665378Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:08.3666327Z kernel = self.compile( 2025-05-07T20:32:08.3667272Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:08.3668434Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:08.3669118Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:08.3669510Z 2025-05-07T20:32:08.3669957Z self = 2025-05-07T20:32:08.3671867Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:08.3674341Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1d7cdb9af0>} 2025-05-07T20:32:08.3676698Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:08.3678634Z context = 2025-05-07T20:32:08.3679150Z 2025-05-07T20:32:08.3679425Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:08.3680312Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:08.3681118Z module_map=module_map) 2025-05-07T20:32:08.3681710Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:08.3682299Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:08.3682741Z E ^ 2025-05-07T20:32:08.3683983Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:08.3684793Z 2025-05-07T20:32:08.3685535Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:08.3686454Z 2025-05-07T20:32:08.3686622Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:08.3687331Z self=, 2025-05-07T20:32:08.3688015Z T=1, 2025-05-07T20:32:08.3688315Z D=5120, 2025-05-07T20:32:08.3688628Z scale_ub=None, 2025-05-07T20:32:08.3688964Z contiguous=True, 2025-05-07T20:32:08.3689325Z compiled=False, 2025-05-07T20:32:08.3689655Z ) 2025-05-07T20:32:08.5647655Z self = 2025-05-07T20:32:08.5648558Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:08.5649003Z 2025-05-07T20:32:08.5649129Z @given( 2025-05-07T20:32:08.5649501Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:08.5650008Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:08.5650530Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:08.5651069Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:08.5651914Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:08.5652318Z ) 2025-05-07T20:32:08.5652829Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:08.5653522Z def test_silu_mul_quant( 2025-05-07T20:32:08.5653890Z self, 2025-05-07T20:32:08.5654191Z T: int, 2025-05-07T20:32:08.5654502Z D: int, 2025-05-07T20:32:08.5654835Z scale_ub: Optional[float], 2025-05-07T20:32:08.5655253Z contiguous: bool, 2025-05-07T20:32:08.5655640Z compiled: bool, 2025-05-07T20:32:08.5656003Z ) -> None: 2025-05-07T20:32:08.5656364Z torch.manual_seed(2025) 2025-05-07T20:32:08.5656762Z 2025-05-07T20:32:08.5657181Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:08.5657750Z 2025-05-07T20:32:08.5658057Z x_sign = torch.sign(x) 2025-05-07T20:32:08.5658520Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:08.5659036Z x = x_sign * x_clamp 2025-05-07T20:32:08.5659432Z x0 = x[:, :D] 2025-05-07T20:32:08.5659765Z x1 = x[:, D:] 2025-05-07T20:32:08.5660099Z 2025-05-07T20:32:08.5660394Z if contiguous: 2025-05-07T20:32:08.5660756Z x0 = x0.contiguous() 2025-05-07T20:32:08.5661180Z x1 = x1.contiguous() 2025-05-07T20:32:08.5661570Z 2025-05-07T20:32:08.5661876Z if scale_ub is not None: 2025-05-07T20:32:08.5662319Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:08.5662865Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:08.5663368Z ) 2025-05-07T20:32:08.5663670Z else: 2025-05-07T20:32:08.5664010Z scale_ub_tensor = None 2025-05-07T20:32:08.5664420Z 2025-05-07T20:32:08.5664779Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:08.5665559Z op = silu_mul_quant 2025-05-07T20:32:08.5665964Z if compiled: 2025-05-07T20:32:08.5666356Z op = torch.compile(op) 2025-05-07T20:32:08.5666847Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:08.5667301Z 2025-05-07T20:32:08.5667599Z > y_fp8, y_scale = fn() 2025-05-07T20:32:08.5667876Z 2025-05-07T20:32:08.5668037Z moe/activation_test.py:117: 2025-05-07T20:32:08.5668525Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:08.5669125Z moe/activation_test.py:115: in fn 2025-05-07T20:32:08.5669579Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:08.5670935Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:08.5672144Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:08.5673051Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:08.5674253Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:08.5675416Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:08.5676330Z kernel = self.compile( 2025-05-07T20:32:08.5677255Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:08.5678382Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:08.5679060Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:08.5679459Z 2025-05-07T20:32:08.5679799Z self = 2025-05-07T20:32:08.5681648Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:08.5684596Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1d7d2f18b0>} 2025-05-07T20:32:08.5686964Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:08.5688775Z context = 2025-05-07T20:32:08.5689278Z 2025-05-07T20:32:08.5689550Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:08.5690448Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:08.5691258Z module_map=module_map) 2025-05-07T20:32:08.5691859Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:08.5692432Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:08.5692846Z E ^ 2025-05-07T20:32:08.5693645Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:08.5694447Z 2025-05-07T20:32:08.5695181Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:08.5696103Z 2025-05-07T20:32:08.5696272Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:08.5696968Z self=, 2025-05-07T20:32:08.5697649Z T=128, 2025-05-07T20:32:08.5697937Z D=5120, 2025-05-07T20:32:08.5698242Z scale_ub=None, 2025-05-07T20:32:08.5698588Z contiguous=False, 2025-05-07T20:32:08.5698939Z compiled=True, 2025-05-07T20:32:08.5699266Z ) 2025-05-07T20:32:08.5699821Z self = 2025-05-07T20:32:08.5700852Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:08.5701323Z 2025-05-07T20:32:08.5701447Z @given( 2025-05-07T20:32:08.5701826Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:08.5702344Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:08.5702845Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:08.5703395Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:08.5703940Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:08.5704407Z ) 2025-05-07T20:32:08.5704994Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:08.5705729Z def test_silu_mul_quant( 2025-05-07T20:32:08.5706107Z self, 2025-05-07T20:32:08.5706417Z T: int, 2025-05-07T20:32:08.5706730Z D: int, 2025-05-07T20:32:08.5707070Z scale_ub: Optional[float], 2025-05-07T20:32:08.5707510Z contiguous: bool, 2025-05-07T20:32:08.5707913Z compiled: bool, 2025-05-07T20:32:08.5708259Z ) -> None: 2025-05-07T20:32:08.5708619Z torch.manual_seed(2025) 2025-05-07T20:32:08.5709012Z 2025-05-07T20:32:08.5709499Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:08.5710213Z 2025-05-07T20:32:08.5710513Z x_sign = torch.sign(x) 2025-05-07T20:32:08.5710987Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:08.5711507Z x = x_sign * x_clamp 2025-05-07T20:32:08.5711889Z x0 = x[:, :D] 2025-05-07T20:32:08.5712226Z x1 = x[:, D:] 2025-05-07T20:32:08.5712564Z 2025-05-07T20:32:08.5712848Z if contiguous: 2025-05-07T20:32:08.5713219Z x0 = x0.contiguous() 2025-05-07T20:32:08.5713640Z x1 = x1.contiguous() 2025-05-07T20:32:08.5714032Z 2025-05-07T20:32:08.5714331Z if scale_ub is not None: 2025-05-07T20:32:08.5714781Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:08.5715340Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:08.5715844Z ) 2025-05-07T20:32:08.5716151Z else: 2025-05-07T20:32:08.5716653Z scale_ub_tensor = None 2025-05-07T20:32:08.5717059Z 2025-05-07T20:32:08.5717426Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:08.5717946Z op = silu_mul_quant 2025-05-07T20:32:08.5718340Z if compiled: 2025-05-07T20:32:08.5718738Z op = torch.compile(op) 2025-05-07T20:32:08.5719219Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:08.5719656Z 2025-05-07T20:32:08.5719955Z > y_fp8, y_scale = fn() 2025-05-07T20:32:08.5720224Z 2025-05-07T20:32:08.5720384Z moe/activation_test.py:117: 2025-05-07T20:32:08.5720865Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:08.5721409Z moe/activation_test.py:115: in fn 2025-05-07T20:32:08.5721870Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:08.5722831Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:08.5723799Z return fn(*args, **kwargs) 2025-05-07T20:32:08.5724930Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:08.5726043Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:08.5726947Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:08.5728130Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:08.5729294Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:08.5730220Z kernel = self.compile( 2025-05-07T20:32:08.5731143Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:08.5732433Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:08.5733111Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:08.5733514Z 2025-05-07T20:32:08.5733865Z self = 2025-05-07T20:32:08.5735764Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:08.5738150Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1d7cd1ee50>} 2025-05-07T20:32:08.5740603Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:08.5742286Z context = 2025-05-07T20:32:08.5742782Z 2025-05-07T20:32:08.5743072Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:08.5743962Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:08.5744764Z module_map=module_map) 2025-05-07T20:32:08.5745361Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:08.5745933Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:08.5746368Z E ^ 2025-05-07T20:32:08.5747162Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:08.5747963Z 2025-05-07T20:32:08.5748750Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:08.5749663Z 2025-05-07T20:32:08.5749934Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:08.5750628Z self=, 2025-05-07T20:32:08.5751311Z T=128, 2025-05-07T20:32:08.5751735Z D=7168, 2025-05-07T20:32:08.5752055Z scale_ub=1200.0, 2025-05-07T20:32:08.5752416Z contiguous=False, 2025-05-07T20:32:08.5752773Z compiled=False, 2025-05-07T20:32:08.5753115Z ) 2025-05-07T20:32:08.7280177Z self = 2025-05-07T20:32:08.7281073Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:08.7281551Z 2025-05-07T20:32:08.7281682Z @given( 2025-05-07T20:32:08.7282044Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:08.7282545Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:08.7283400Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:08.7283949Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:08.7284440Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:08.7284836Z ) 2025-05-07T20:32:08.7285363Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:08.7286047Z def test_silu_mul_quant( 2025-05-07T20:32:08.7286423Z self, 2025-05-07T20:32:08.7286733Z T: int, 2025-05-07T20:32:08.7287028Z D: int, 2025-05-07T20:32:08.7287362Z scale_ub: Optional[float], 2025-05-07T20:32:08.7287786Z contiguous: bool, 2025-05-07T20:32:08.7288173Z compiled: bool, 2025-05-07T20:32:08.7288550Z ) -> None: 2025-05-07T20:32:08.7288889Z torch.manual_seed(2025) 2025-05-07T20:32:08.7289275Z 2025-05-07T20:32:08.7289697Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:08.7290274Z 2025-05-07T20:32:08.7290580Z x_sign = torch.sign(x) 2025-05-07T20:32:08.7291038Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:08.7291919Z x = x_sign * x_clamp 2025-05-07T20:32:08.7292322Z x0 = x[:, :D] 2025-05-07T20:32:08.7292668Z x1 = x[:, D:] 2025-05-07T20:32:08.7293004Z 2025-05-07T20:32:08.7293305Z if contiguous: 2025-05-07T20:32:08.7293674Z x0 = x0.contiguous() 2025-05-07T20:32:08.7294102Z x1 = x1.contiguous() 2025-05-07T20:32:08.7294488Z 2025-05-07T20:32:08.7294794Z if scale_ub is not None: 2025-05-07T20:32:08.7295239Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:08.7295793Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:08.7296297Z ) 2025-05-07T20:32:08.7296611Z else: 2025-05-07T20:32:08.7296946Z scale_ub_tensor = None 2025-05-07T20:32:08.7297362Z 2025-05-07T20:32:08.7297730Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:08.7298254Z op = silu_mul_quant 2025-05-07T20:32:08.7298662Z if compiled: 2025-05-07T20:32:08.7299059Z op = torch.compile(op) 2025-05-07T20:32:08.7299543Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:08.7300003Z 2025-05-07T20:32:08.7300304Z > y_fp8, y_scale = fn() 2025-05-07T20:32:08.7300589Z 2025-05-07T20:32:08.7300746Z moe/activation_test.py:117: 2025-05-07T20:32:08.7301239Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:08.7301784Z moe/activation_test.py:115: in fn 2025-05-07T20:32:08.7302247Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:08.7303447Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:08.7304649Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:08.7305574Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:08.7306773Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:08.7307939Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:08.7308919Z kernel = self.compile( 2025-05-07T20:32:08.7310230Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:08.7311393Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:08.7312056Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:08.7312435Z 2025-05-07T20:32:08.7312771Z self = 2025-05-07T20:32:08.7314594Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:08.7317064Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1d7cdb9c10>} 2025-05-07T20:32:08.7319490Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:08.7321304Z context = 2025-05-07T20:32:08.7321803Z 2025-05-07T20:32:08.7322073Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:08.7322971Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:08.7323778Z module_map=module_map) 2025-05-07T20:32:08.7324354Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:08.7324929Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:08.7325352Z E ^ 2025-05-07T20:32:08.7326291Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:08.7327092Z 2025-05-07T20:32:08.7327829Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:08.7328738Z 2025-05-07T20:32:08.7328901Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:08.7329595Z self=, 2025-05-07T20:32:08.7330268Z T=128, 2025-05-07T20:32:08.7330566Z D=5120, 2025-05-07T20:32:08.7330870Z scale_ub=None, 2025-05-07T20:32:08.7331214Z contiguous=False, 2025-05-07T20:32:08.7331567Z compiled=False, 2025-05-07T20:32:08.7331898Z ) 2025-05-07T20:32:08.7332425Z self = 2025-05-07T20:32:08.7333253Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:08.7333742Z 2025-05-07T20:32:08.7333863Z @given( 2025-05-07T20:32:08.7334230Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:08.7334740Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:08.7335246Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:08.7335789Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:08.7336324Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:08.7336800Z ) 2025-05-07T20:32:08.7337383Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:08.7338146Z def test_silu_mul_quant( 2025-05-07T20:32:08.7338523Z self, 2025-05-07T20:32:08.7338841Z T: int, 2025-05-07T20:32:08.7339190Z D: int, 2025-05-07T20:32:08.7339527Z scale_ub: Optional[float], 2025-05-07T20:32:08.7339970Z contiguous: bool, 2025-05-07T20:32:08.7340358Z compiled: bool, 2025-05-07T20:32:08.7340710Z ) -> None: 2025-05-07T20:32:08.7341059Z torch.manual_seed(2025) 2025-05-07T20:32:08.7341457Z 2025-05-07T20:32:08.7341884Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:08.7342456Z 2025-05-07T20:32:08.7342894Z x_sign = torch.sign(x) 2025-05-07T20:32:08.7343363Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:08.7343878Z x = x_sign * x_clamp 2025-05-07T20:32:08.7344271Z x0 = x[:, :D] 2025-05-07T20:32:08.7344607Z x1 = x[:, D:] 2025-05-07T20:32:08.7344939Z 2025-05-07T20:32:08.7345225Z if contiguous: 2025-05-07T20:32:08.7345590Z x0 = x0.contiguous() 2025-05-07T20:32:08.7345998Z x1 = x1.contiguous() 2025-05-07T20:32:08.7346387Z 2025-05-07T20:32:08.7346698Z if scale_ub is not None: 2025-05-07T20:32:08.7347136Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:08.7347685Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:08.7348196Z ) 2025-05-07T20:32:08.7348529Z else: 2025-05-07T20:32:08.7348889Z scale_ub_tensor = None 2025-05-07T20:32:08.7349303Z 2025-05-07T20:32:08.7349663Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:08.7350272Z op = silu_mul_quant 2025-05-07T20:32:08.7350676Z if compiled: 2025-05-07T20:32:08.7351065Z op = torch.compile(op) 2025-05-07T20:32:08.7351570Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:08.7352021Z 2025-05-07T20:32:08.7352314Z > y_fp8, y_scale = fn() 2025-05-07T20:32:08.7352595Z 2025-05-07T20:32:08.7352750Z moe/activation_test.py:117: 2025-05-07T20:32:08.7353229Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:08.7353779Z moe/activation_test.py:115: in fn 2025-05-07T20:32:08.7354240Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:08.7355424Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:08.7356824Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:08.7357704Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:08.7358816Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:08.7359971Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:08.7360898Z kernel = self.compile( 2025-05-07T20:32:08.7361823Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:08.7362978Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:08.7375845Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:08.7376273Z 2025-05-07T20:32:08.7376617Z self = 2025-05-07T20:32:08.7378545Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:08.7381005Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1d7cd8da60>} 2025-05-07T20:32:08.7383699Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:08.7385511Z context = 2025-05-07T20:32:08.7386022Z 2025-05-07T20:32:08.7386295Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:08.7387190Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:08.7387994Z module_map=module_map) 2025-05-07T20:32:08.7388594Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:08.7389405Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:08.7389910Z E ^ 2025-05-07T20:32:08.7390714Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:08.7391520Z 2025-05-07T20:32:08.7392244Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:08.7393148Z 2025-05-07T20:32:08.7393323Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:08.7394008Z self=, 2025-05-07T20:32:08.7394693Z T=128, 2025-05-07T20:32:08.7394997Z D=5120, 2025-05-07T20:32:08.7395300Z scale_ub=1200.0, 2025-05-07T20:32:08.7395673Z contiguous=True, 2025-05-07T20:32:08.7396034Z compiled=False, 2025-05-07T20:32:08.7396358Z ) 2025-05-07T20:32:08.9702082Z self = 2025-05-07T20:32:08.9702882Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:08.9703249Z 2025-05-07T20:32:08.9703361Z @given( 2025-05-07T20:32:08.9703653Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:08.9704025Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:08.9704341Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:08.9704675Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:08.9705014Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:08.9705304Z ) 2025-05-07T20:32:08.9705660Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:08.9706134Z def test_silu_mul_quant( 2025-05-07T20:32:08.9706385Z self, 2025-05-07T20:32:08.9706886Z T: int, 2025-05-07T20:32:08.9707086Z D: int, 2025-05-07T20:32:08.9707305Z scale_ub: Optional[float], 2025-05-07T20:32:08.9707583Z contiguous: bool, 2025-05-07T20:32:08.9707824Z compiled: bool, 2025-05-07T20:32:08.9708057Z ) -> None: 2025-05-07T20:32:08.9708286Z torch.manual_seed(2025) 2025-05-07T20:32:08.9708561Z 2025-05-07T20:32:08.9708840Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:08.9709202Z 2025-05-07T20:32:08.9709390Z x_sign = torch.sign(x) 2025-05-07T20:32:08.9709819Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:08.9710140Z x = x_sign * x_clamp 2025-05-07T20:32:08.9710376Z x0 = x[:, :D] 2025-05-07T20:32:08.9710592Z x1 = x[:, D:] 2025-05-07T20:32:08.9710800Z 2025-05-07T20:32:08.9710983Z if contiguous: 2025-05-07T20:32:08.9711220Z x0 = x0.contiguous() 2025-05-07T20:32:08.9711488Z x1 = x1.contiguous() 2025-05-07T20:32:08.9711727Z 2025-05-07T20:32:08.9711919Z if scale_ub is not None: 2025-05-07T20:32:08.9712198Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:08.9712542Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:08.9712867Z ) 2025-05-07T20:32:08.9713059Z else: 2025-05-07T20:32:08.9713275Z scale_ub_tensor = None 2025-05-07T20:32:08.9713530Z 2025-05-07T20:32:08.9713762Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:08.9714086Z op = silu_mul_quant 2025-05-07T20:32:08.9714336Z if compiled: 2025-05-07T20:32:08.9714587Z op = torch.compile(op) 2025-05-07T20:32:08.9714891Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:08.9715166Z 2025-05-07T20:32:08.9715361Z > y_fp8, y_scale = fn() 2025-05-07T20:32:08.9715529Z 2025-05-07T20:32:08.9715633Z moe/activation_test.py:117: 2025-05-07T20:32:08.9715934Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:08.9716280Z moe/activation_test.py:115: in fn 2025-05-07T20:32:08.9716570Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:08.9717463Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:08.9718210Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:08.9718779Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:08.9719512Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:08.9720211Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:08.9720779Z kernel = self.compile( 2025-05-07T20:32:08.9721351Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:08.9722057Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:08.9722467Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:08.9722722Z 2025-05-07T20:32:08.9722935Z self = 2025-05-07T20:32:08.9724108Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:08.9725627Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1d7cd8f550>} 2025-05-07T20:32:08.9727093Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:08.9728285Z context = 2025-05-07T20:32:08.9728596Z 2025-05-07T20:32:08.9728770Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:08.9729370Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:08.9729857Z module_map=module_map) 2025-05-07T20:32:08.9730234Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:08.9730601Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:08.9730865Z E ^ 2025-05-07T20:32:08.9731346Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:08.9731844Z 2025-05-07T20:32:08.9732291Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:08.9732856Z 2025-05-07T20:32:08.9732966Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:08.9733397Z self=, 2025-05-07T20:32:08.9733813Z T=1, 2025-05-07T20:32:08.9734001Z D=7168, 2025-05-07T20:32:08.9734197Z scale_ub=1200.0, 2025-05-07T20:32:08.9734414Z contiguous=True, 2025-05-07T20:32:08.9734643Z compiled=True, 2025-05-07T20:32:08.9734854Z ) 2025-05-07T20:32:08.9735177Z self = 2025-05-07T20:32:08.9735690Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:08.9735963Z 2025-05-07T20:32:08.9736047Z @given( 2025-05-07T20:32:08.9736273Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:08.9736595Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:08.9736911Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:08.9737256Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:08.9737595Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:08.9737893Z ) 2025-05-07T20:32:08.9738257Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:08.9738799Z def test_silu_mul_quant( 2025-05-07T20:32:08.9739044Z self, 2025-05-07T20:32:08.9739236Z T: int, 2025-05-07T20:32:08.9739426Z D: int, 2025-05-07T20:32:08.9739643Z scale_ub: Optional[float], 2025-05-07T20:32:08.9739914Z contiguous: bool, 2025-05-07T20:32:08.9740150Z compiled: bool, 2025-05-07T20:32:08.9740374Z ) -> None: 2025-05-07T20:32:08.9740588Z torch.manual_seed(2025) 2025-05-07T20:32:08.9740827Z 2025-05-07T20:32:08.9741107Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:08.9741463Z 2025-05-07T20:32:08.9741650Z x_sign = torch.sign(x) 2025-05-07T20:32:08.9741946Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:08.9742270Z x = x_sign * x_clamp 2025-05-07T20:32:08.9742515Z x0 = x[:, :D] 2025-05-07T20:32:08.9742728Z x1 = x[:, D:] 2025-05-07T20:32:08.9742938Z 2025-05-07T20:32:08.9743123Z if contiguous: 2025-05-07T20:32:08.9743348Z x0 = x0.contiguous() 2025-05-07T20:32:08.9743612Z x1 = x1.contiguous() 2025-05-07T20:32:08.9743857Z 2025-05-07T20:32:08.9744041Z if scale_ub is not None: 2025-05-07T20:32:08.9744317Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:08.9744657Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:08.9744971Z ) 2025-05-07T20:32:08.9745163Z else: 2025-05-07T20:32:08.9745369Z scale_ub_tensor = None 2025-05-07T20:32:08.9745820Z 2025-05-07T20:32:08.9746053Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:08.9746380Z op = silu_mul_quant 2025-05-07T20:32:08.9746630Z if compiled: 2025-05-07T20:32:08.9746973Z op = torch.compile(op) 2025-05-07T20:32:08.9747277Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:08.9747563Z 2025-05-07T20:32:08.9747751Z > y_fp8, y_scale = fn() 2025-05-07T20:32:08.9747930Z 2025-05-07T20:32:08.9748027Z moe/activation_test.py:117: 2025-05-07T20:32:08.9748330Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:08.9748669Z moe/activation_test.py:115: in fn 2025-05-07T20:32:08.9748956Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:08.9749547Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:08.9750199Z return fn(*args, **kwargs) 2025-05-07T20:32:08.9750907Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:08.9751650Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:08.9752222Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:08.9752951Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:08.9753671Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:08.9754241Z kernel = self.compile( 2025-05-07T20:32:08.9754805Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:08.9755506Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:08.9755920Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:08.9756161Z 2025-05-07T20:32:08.9756376Z self = 2025-05-07T20:32:08.9757555Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:08.9759199Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1d7cd8d160>} 2025-05-07T20:32:08.9760662Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:08.9761776Z context = 2025-05-07T20:32:08.9762090Z 2025-05-07T20:32:08.9762258Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:08.9762813Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:08.9763298Z module_map=module_map) 2025-05-07T20:32:08.9763677Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:08.9764040Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:08.9764302Z E ^ 2025-05-07T20:32:08.9764801Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:08.9765298Z 2025-05-07T20:32:08.9765746Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:08.9766300Z 2025-05-07T20:32:08.9766409Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:08.9766831Z self=, 2025-05-07T20:32:08.9767257Z T=1, 2025-05-07T20:32:08.9767444Z D=7168, 2025-05-07T20:32:08.9767631Z scale_ub=1200.0, 2025-05-07T20:32:08.9767856Z contiguous=False, 2025-05-07T20:32:08.9768083Z compiled=True, 2025-05-07T20:32:08.9768282Z ) 2025-05-07T20:32:09.3572215Z self = 2025-05-07T20:32:09.3573225Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:09.3573612Z 2025-05-07T20:32:09.3573703Z @given( 2025-05-07T20:32:09.3573946Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:09.3574270Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:09.3574586Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:09.3574928Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:09.3575262Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:09.3575557Z ) 2025-05-07T20:32:09.3575922Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:09.3576388Z def test_silu_mul_quant( 2025-05-07T20:32:09.3576633Z self, 2025-05-07T20:32:09.3576830Z T: int, 2025-05-07T20:32:09.3577023Z D: int, 2025-05-07T20:32:09.3577242Z scale_ub: Optional[float], 2025-05-07T20:32:09.3577526Z contiguous: bool, 2025-05-07T20:32:09.3577762Z compiled: bool, 2025-05-07T20:32:09.3577994Z ) -> None: 2025-05-07T20:32:09.3578208Z torch.manual_seed(2025) 2025-05-07T20:32:09.3578459Z 2025-05-07T20:32:09.3578737Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:09.3579098Z 2025-05-07T20:32:09.3579286Z x_sign = torch.sign(x) 2025-05-07T20:32:09.3579585Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:09.3579907Z x = x_sign * x_clamp 2025-05-07T20:32:09.3580156Z x0 = x[:, :D] 2025-05-07T20:32:09.3580368Z x1 = x[:, D:] 2025-05-07T20:32:09.3580579Z 2025-05-07T20:32:09.3580765Z if contiguous: 2025-05-07T20:32:09.3580993Z x0 = x0.contiguous() 2025-05-07T20:32:09.3581259Z x1 = x1.contiguous() 2025-05-07T20:32:09.3581507Z 2025-05-07T20:32:09.3581688Z if scale_ub is not None: 2025-05-07T20:32:09.3581972Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:09.3582319Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:09.3582634Z ) 2025-05-07T20:32:09.3583133Z else: 2025-05-07T20:32:09.3583502Z scale_ub_tensor = None 2025-05-07T20:32:09.3583757Z 2025-05-07T20:32:09.3583988Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:09.3584313Z op = silu_mul_quant 2025-05-07T20:32:09.3584562Z if compiled: 2025-05-07T20:32:09.3584812Z op = torch.compile(op) 2025-05-07T20:32:09.3585132Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:09.3585416Z 2025-05-07T20:32:09.3585600Z > y_fp8, y_scale = fn() 2025-05-07T20:32:09.3585775Z 2025-05-07T20:32:09.3585873Z moe/activation_test.py:117: 2025-05-07T20:32:09.3586173Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:09.3586514Z moe/activation_test.py:115: in fn 2025-05-07T20:32:09.3586804Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:09.3587395Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:09.3587998Z return fn(*args, **kwargs) 2025-05-07T20:32:09.3588696Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:09.3589438Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:09.3590135Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:09.3590868Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:09.3591569Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:09.3592137Z kernel = self.compile( 2025-05-07T20:32:09.3592713Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:09.3593571Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:09.3593991Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:09.3594240Z 2025-05-07T20:32:09.3594452Z self = 2025-05-07T20:32:09.3595618Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:09.3597131Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1d7c6c1670>} 2025-05-07T20:32:09.3598600Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:09.3599716Z context = 2025-05-07T20:32:09.3600020Z 2025-05-07T20:32:09.3600198Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:09.3600746Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:09.3601231Z module_map=module_map) 2025-05-07T20:32:09.3601608Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:09.3601972Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:09.3602229Z E ^ 2025-05-07T20:32:09.3602725Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:09.3603213Z 2025-05-07T20:32:09.3603666Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:09.3604228Z 2025-05-07T20:32:09.3604336Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:09.3604756Z self=, 2025-05-07T20:32:09.3605286Z T=1, 2025-05-07T20:32:09.3605477Z D=7168, 2025-05-07T20:32:09.3605665Z scale_ub=None, 2025-05-07T20:32:09.3605886Z contiguous=False, 2025-05-07T20:32:09.3606113Z compiled=True, 2025-05-07T20:32:09.3606313Z ) 2025-05-07T20:32:09.4771252Z self = 2025-05-07T20:32:09.4771980Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:09.4772277Z 2025-05-07T20:32:09.4772360Z @given( 2025-05-07T20:32:09.4772597Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:09.4772922Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:09.4773245Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:09.4773611Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:09.4773948Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:09.4774253Z ) 2025-05-07T20:32:09.4774639Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:09.4775125Z def test_silu_mul_quant( 2025-05-07T20:32:09.4775374Z self, 2025-05-07T20:32:09.4775578Z T: int, 2025-05-07T20:32:09.4775783Z D: int, 2025-05-07T20:32:09.4776003Z scale_ub: Optional[float], 2025-05-07T20:32:09.4776289Z contiguous: bool, 2025-05-07T20:32:09.4776537Z compiled: bool, 2025-05-07T20:32:09.4776765Z ) -> None: 2025-05-07T20:32:09.4776987Z torch.manual_seed(2025) 2025-05-07T20:32:09.4777243Z 2025-05-07T20:32:09.4777520Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:09.4777888Z 2025-05-07T20:32:09.4778085Z x_sign = torch.sign(x) 2025-05-07T20:32:09.4778385Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:09.4779023Z x = x_sign * x_clamp 2025-05-07T20:32:09.4779280Z x0 = x[:, :D] 2025-05-07T20:32:09.4779500Z x1 = x[:, D:] 2025-05-07T20:32:09.4779719Z 2025-05-07T20:32:09.4779918Z if contiguous: 2025-05-07T20:32:09.4780163Z x0 = x0.contiguous() 2025-05-07T20:32:09.4780430Z x1 = x1.contiguous() 2025-05-07T20:32:09.4780683Z 2025-05-07T20:32:09.4780884Z if scale_ub is not None: 2025-05-07T20:32:09.4781166Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:09.4781517Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:09.4781851Z ) 2025-05-07T20:32:09.4782039Z else: 2025-05-07T20:32:09.4782249Z scale_ub_tensor = None 2025-05-07T20:32:09.4782510Z 2025-05-07T20:32:09.4783032Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:09.4783369Z op = silu_mul_quant 2025-05-07T20:32:09.4783636Z if compiled: 2025-05-07T20:32:09.4783880Z op = torch.compile(op) 2025-05-07T20:32:09.4784192Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:09.4784482Z 2025-05-07T20:32:09.4784676Z y_fp8, y_scale = fn() 2025-05-07T20:32:09.4784969Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:09.4785271Z 2025-05-07T20:32:09.4785511Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:09.4785858Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:09.4786162Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:09.4786488Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:09.4786859Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:09.4787184Z 2025-05-07T20:32:09.4787384Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:09.4787591Z 2025-05-07T20:32:09.4787693Z moe/activation_test.py:126: 2025-05-07T20:32:09.4787998Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:09.4788356Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:09.4788698Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:09.4789878Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:09.4790722Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:09.4791316Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:09.4792062Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:09.4792823Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:09.4793618Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:09.4794448Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:09.4795268Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:09.4796071Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:09.4796770Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:09.4797424Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:09.4797985Z fn() 2025-05-07T20:32:09.4798531Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:09.4799165Z self.fn.run( 2025-05-07T20:32:09.4799659Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:09.4800235Z kernel = self.compile( 2025-05-07T20:32:09.4800938Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:09.4801660Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:09.4802079Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:09.4802331Z 2025-05-07T20:32:09.4802562Z self = 2025-05-07T20:32:09.4803764Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:09.4805324Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1d7c188430>} 2025-05-07T20:32:09.4806826Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:09.4807972Z context = 2025-05-07T20:32:09.4808287Z 2025-05-07T20:32:09.4808462Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:09.4809029Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:09.4809525Z module_map=module_map) 2025-05-07T20:32:09.4809910Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:09.4810284Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:09.4810560Z E ^ 2025-05-07T20:32:09.4812270Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:09.4812776Z 2025-05-07T20:32:09.4813239Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:09.4813808Z 2025-05-07T20:32:09.4813918Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:09.4814490Z self=, 2025-05-07T20:32:09.4823212Z T=1, 2025-05-07T20:32:09.4823420Z D=5120, 2025-05-07T20:32:09.4823617Z scale_ub=1200.0, 2025-05-07T20:32:09.4823849Z contiguous=False, 2025-05-07T20:32:09.4824075Z compiled=True, 2025-05-07T20:32:09.4824285Z ) 2025-05-07T20:32:09.6835880Z self = 2025-05-07T20:32:09.6836566Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:09.6836963Z 2025-05-07T20:32:09.6837071Z @given( 2025-05-07T20:32:09.6837400Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:09.6837785Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:09.6838136Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:09.6838492Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:09.6838840Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:09.6839161Z ) 2025-05-07T20:32:09.6839542Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:09.6840016Z def test_silu_mul_quant( 2025-05-07T20:32:09.6840279Z self, 2025-05-07T20:32:09.6840490Z T: int, 2025-05-07T20:32:09.6840693Z D: int, 2025-05-07T20:32:09.6840929Z scale_ub: Optional[float], 2025-05-07T20:32:09.6841222Z contiguous: bool, 2025-05-07T20:32:09.6841479Z compiled: bool, 2025-05-07T20:32:09.6841716Z ) -> None: 2025-05-07T20:32:09.6841946Z torch.manual_seed(2025) 2025-05-07T20:32:09.6842209Z 2025-05-07T20:32:09.6842492Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:09.6842863Z 2025-05-07T20:32:09.6843406Z x_sign = torch.sign(x) 2025-05-07T20:32:09.6843708Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:09.6844041Z x = x_sign * x_clamp 2025-05-07T20:32:09.6844299Z x0 = x[:, :D] 2025-05-07T20:32:09.6844522Z x1 = x[:, D:] 2025-05-07T20:32:09.6844736Z 2025-05-07T20:32:09.6844928Z if contiguous: 2025-05-07T20:32:09.6845158Z x0 = x0.contiguous() 2025-05-07T20:32:09.6845426Z x1 = x1.contiguous() 2025-05-07T20:32:09.6845675Z 2025-05-07T20:32:09.6845881Z if scale_ub is not None: 2025-05-07T20:32:09.6846157Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:09.6846508Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:09.6846832Z ) 2025-05-07T20:32:09.6847025Z else: 2025-05-07T20:32:09.6847241Z scale_ub_tensor = None 2025-05-07T20:32:09.6847503Z 2025-05-07T20:32:09.6847738Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:09.6848073Z op = silu_mul_quant 2025-05-07T20:32:09.6848334Z if compiled: 2025-05-07T20:32:09.6848587Z op = torch.compile(op) 2025-05-07T20:32:09.6848896Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:09.6849186Z 2025-05-07T20:32:09.6849384Z > y_fp8, y_scale = fn() 2025-05-07T20:32:09.6849555Z 2025-05-07T20:32:09.6849655Z moe/activation_test.py:117: 2025-05-07T20:32:09.6849965Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:09.6850314Z moe/activation_test.py:115: in fn 2025-05-07T20:32:09.6850597Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:09.6851195Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:09.6851801Z return fn(*args, **kwargs) 2025-05-07T20:32:09.6852518Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:09.6853265Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:09.6853983Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:09.6854723Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:09.6855437Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:09.6856006Z kernel = self.compile( 2025-05-07T20:32:09.6856586Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:09.6857289Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:09.6857702Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:09.6857956Z 2025-05-07T20:32:09.6858171Z self = 2025-05-07T20:32:09.6859357Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:09.6860882Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1d7c188e50>} 2025-05-07T20:32:09.6862356Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:09.6863462Z context = 2025-05-07T20:32:09.6863775Z 2025-05-07T20:32:09.6863946Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:09.6864498Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:09.6865084Z module_map=module_map) 2025-05-07T20:32:09.6865460Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:09.6865835Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:09.6866106Z E ^ 2025-05-07T20:32:09.6866593Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:09.6867091Z 2025-05-07T20:32:09.6867541Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:09.6868104Z 2025-05-07T20:32:09.6868214Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:09.6868649Z self=, 2025-05-07T20:32:09.6869120Z T=1, 2025-05-07T20:32:09.6869312Z D=5120, 2025-05-07T20:32:09.6869510Z scale_ub=1200.0, 2025-05-07T20:32:09.6869979Z contiguous=False, 2025-05-07T20:32:09.6870208Z compiled=False, 2025-05-07T20:32:09.6870425Z ) 2025-05-07T20:32:09.6870754Z self = 2025-05-07T20:32:09.6871274Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:09.6871567Z 2025-05-07T20:32:09.6871647Z @given( 2025-05-07T20:32:09.6871881Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:09.6872200Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:09.6872518Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:09.6872861Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:09.6873205Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:09.6873498Z ) 2025-05-07T20:32:09.6873863Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:09.6874336Z def test_silu_mul_quant( 2025-05-07T20:32:09.6874579Z self, 2025-05-07T20:32:09.6874788Z T: int, 2025-05-07T20:32:09.6874994Z D: int, 2025-05-07T20:32:09.6875208Z scale_ub: Optional[float], 2025-05-07T20:32:09.6875492Z contiguous: bool, 2025-05-07T20:32:09.6875833Z compiled: bool, 2025-05-07T20:32:09.6876059Z ) -> None: 2025-05-07T20:32:09.6876280Z torch.manual_seed(2025) 2025-05-07T20:32:09.6876535Z 2025-05-07T20:32:09.6876805Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:09.6877172Z 2025-05-07T20:32:09.6877368Z x_sign = torch.sign(x) 2025-05-07T20:32:09.6877664Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:09.6877987Z x = x_sign * x_clamp 2025-05-07T20:32:09.6878231Z x0 = x[:, :D] 2025-05-07T20:32:09.6878458Z x1 = x[:, D:] 2025-05-07T20:32:09.6878667Z 2025-05-07T20:32:09.6878858Z if contiguous: 2025-05-07T20:32:09.6879095Z x0 = x0.contiguous() 2025-05-07T20:32:09.6879355Z x1 = x1.contiguous() 2025-05-07T20:32:09.6879608Z 2025-05-07T20:32:09.6879807Z if scale_ub is not None: 2025-05-07T20:32:09.6880085Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:09.6880442Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:09.6880775Z ) 2025-05-07T20:32:09.6880965Z else: 2025-05-07T20:32:09.6881184Z scale_ub_tensor = None 2025-05-07T20:32:09.6881449Z 2025-05-07T20:32:09.6881678Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:09.6882014Z op = silu_mul_quant 2025-05-07T20:32:09.6882275Z if compiled: 2025-05-07T20:32:09.6882525Z op = torch.compile(op) 2025-05-07T20:32:09.6883126Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:09.6883415Z 2025-05-07T20:32:09.6883617Z > y_fp8, y_scale = fn() 2025-05-07T20:32:09.6883784Z 2025-05-07T20:32:09.6883880Z moe/activation_test.py:117: 2025-05-07T20:32:09.6884315Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:09.6884659Z moe/activation_test.py:115: in fn 2025-05-07T20:32:09.6884942Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:09.6885685Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:09.6886428Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:09.6886995Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:09.6887723Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:09.6888447Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:09.6889050Z kernel = self.compile( 2025-05-07T20:32:09.6889616Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:09.6890323Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:09.6890735Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:09.6890981Z 2025-05-07T20:32:09.6891199Z self = 2025-05-07T20:32:09.6892361Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:09.6893866Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1cefb79820>} 2025-05-07T20:32:09.6895336Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:09.6896452Z context = 2025-05-07T20:32:09.6896756Z 2025-05-07T20:32:09.6897050Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:09.6897596Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:09.6898088Z module_map=module_map) 2025-05-07T20:32:09.6898467Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:09.6898825Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:09.6899089Z E ^ 2025-05-07T20:32:09.6899584Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:09.6900072Z 2025-05-07T20:32:09.6900526Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:09.6901086Z 2025-05-07T20:32:09.6901187Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:09.6901620Z self=, 2025-05-07T20:32:09.6902042Z T=16384, 2025-05-07T20:32:09.6902241Z D=5120, 2025-05-07T20:32:09.6902439Z scale_ub=1200.0, 2025-05-07T20:32:09.6902662Z contiguous=False, 2025-05-07T20:32:09.6902883Z compiled=True, 2025-05-07T20:32:09.6903091Z ) 2025-05-07T20:32:09.8110100Z self = 2025-05-07T20:32:09.8110860Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:09.8111281Z 2025-05-07T20:32:09.8111394Z @given( 2025-05-07T20:32:09.8111715Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:09.8112160Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:09.8112588Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:09.8113047Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:09.8113829Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:09.8114204Z ) 2025-05-07T20:32:09.8114663Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:09.8115131Z def test_silu_mul_quant( 2025-05-07T20:32:09.8115379Z self, 2025-05-07T20:32:09.8115567Z T: int, 2025-05-07T20:32:09.8115767Z D: int, 2025-05-07T20:32:09.8115984Z scale_ub: Optional[float], 2025-05-07T20:32:09.8116254Z contiguous: bool, 2025-05-07T20:32:09.8116502Z compiled: bool, 2025-05-07T20:32:09.8116749Z ) -> None: 2025-05-07T20:32:09.8116970Z torch.manual_seed(2025) 2025-05-07T20:32:09.8117222Z 2025-05-07T20:32:09.8117498Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:09.8117897Z 2025-05-07T20:32:09.8118088Z x_sign = torch.sign(x) 2025-05-07T20:32:09.8118390Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:09.8118722Z x = x_sign * x_clamp 2025-05-07T20:32:09.8118966Z x0 = x[:, :D] 2025-05-07T20:32:09.8119190Z x1 = x[:, D:] 2025-05-07T20:32:09.8119409Z 2025-05-07T20:32:09.8119598Z if contiguous: 2025-05-07T20:32:09.8119838Z x0 = x0.contiguous() 2025-05-07T20:32:09.8120110Z x1 = x1.contiguous() 2025-05-07T20:32:09.8120363Z 2025-05-07T20:32:09.8120557Z if scale_ub is not None: 2025-05-07T20:32:09.8120846Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:09.8121193Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:09.8121507Z ) 2025-05-07T20:32:09.8121697Z else: 2025-05-07T20:32:09.8121909Z scale_ub_tensor = None 2025-05-07T20:32:09.8122160Z 2025-05-07T20:32:09.8122390Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:09.8122714Z op = silu_mul_quant 2025-05-07T20:32:09.8122962Z if compiled: 2025-05-07T20:32:09.8123215Z op = torch.compile(op) 2025-05-07T20:32:09.8123521Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:09.8123799Z 2025-05-07T20:32:09.8123993Z > y_fp8, y_scale = fn() 2025-05-07T20:32:09.8124339Z 2025-05-07T20:32:09.8124447Z moe/activation_test.py:117: 2025-05-07T20:32:09.8124747Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:09.8125088Z moe/activation_test.py:115: in fn 2025-05-07T20:32:09.8125373Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:09.8125965Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:09.8126560Z return fn(*args, **kwargs) 2025-05-07T20:32:09.8127267Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:09.8128014Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:09.8128584Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:09.8129310Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:09.8130021Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:09.8130587Z kernel = self.compile( 2025-05-07T20:32:09.8131154Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:09.8131861Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:09.8132278Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:09.8132519Z 2025-05-07T20:32:09.8132739Z self = 2025-05-07T20:32:09.8133900Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:09.8135513Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1d7c137790>} 2025-05-07T20:32:09.8136981Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:09.8138092Z context = 2025-05-07T20:32:09.8138397Z 2025-05-07T20:32:09.8138574Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:09.8139116Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:09.8139611Z module_map=module_map) 2025-05-07T20:32:09.8139996Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:09.8140353Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:09.8140632Z E ^ 2025-05-07T20:32:09.8141130Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:09.8141624Z 2025-05-07T20:32:09.8142083Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:09.8142640Z 2025-05-07T20:32:09.8142744Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:09.8143180Z self=, 2025-05-07T20:32:09.8143613Z T=2048, 2025-05-07T20:32:09.8143795Z D=7168, 2025-05-07T20:32:09.8143988Z scale_ub=1200.0, 2025-05-07T20:32:09.8144213Z contiguous=False, 2025-05-07T20:32:09.8144433Z compiled=True, 2025-05-07T20:32:09.8144640Z ) 2025-05-07T20:32:09.8144967Z self = 2025-05-07T20:32:09.8145492Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:09.8145790Z 2025-05-07T20:32:09.8145949Z @given( 2025-05-07T20:32:09.8146184Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:09.8146509Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:09.8146821Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:09.8147169Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:09.8147514Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:09.8147804Z ) 2025-05-07T20:32:09.8148169Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:09.8148639Z def test_silu_mul_quant( 2025-05-07T20:32:09.8148874Z self, 2025-05-07T20:32:09.8149064Z T: int, 2025-05-07T20:32:09.8149262Z D: int, 2025-05-07T20:32:09.8149481Z scale_ub: Optional[float], 2025-05-07T20:32:09.8149875Z contiguous: bool, 2025-05-07T20:32:09.8150120Z compiled: bool, 2025-05-07T20:32:09.8150350Z ) -> None: 2025-05-07T20:32:09.8150557Z torch.manual_seed(2025) 2025-05-07T20:32:09.8150803Z 2025-05-07T20:32:09.8151078Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:09.8151429Z 2025-05-07T20:32:09.8151618Z x_sign = torch.sign(x) 2025-05-07T20:32:09.8151915Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:09.8152227Z x = x_sign * x_clamp 2025-05-07T20:32:09.8152478Z x0 = x[:, :D] 2025-05-07T20:32:09.8152703Z x1 = x[:, D:] 2025-05-07T20:32:09.8152914Z 2025-05-07T20:32:09.8153107Z if contiguous: 2025-05-07T20:32:09.8153347Z x0 = x0.contiguous() 2025-05-07T20:32:09.8153610Z x1 = x1.contiguous() 2025-05-07T20:32:09.8153868Z 2025-05-07T20:32:09.8154069Z if scale_ub is not None: 2025-05-07T20:32:09.8154430Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:09.8154775Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:09.8155100Z ) 2025-05-07T20:32:09.8155298Z else: 2025-05-07T20:32:09.8155508Z scale_ub_tensor = None 2025-05-07T20:32:09.8155768Z 2025-05-07T20:32:09.8156002Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:09.8156324Z op = silu_mul_quant 2025-05-07T20:32:09.8156579Z if compiled: 2025-05-07T20:32:09.8156829Z op = torch.compile(op) 2025-05-07T20:32:09.8157126Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:09.8157416Z 2025-05-07T20:32:09.8157618Z > y_fp8, y_scale = fn() 2025-05-07T20:32:09.8157794Z 2025-05-07T20:32:09.8157896Z moe/activation_test.py:117: 2025-05-07T20:32:09.8158206Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:09.8158564Z moe/activation_test.py:115: in fn 2025-05-07T20:32:09.8158860Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:09.8159446Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:09.8160045Z return fn(*args, **kwargs) 2025-05-07T20:32:09.8160749Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:09.8161483Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:09.8162056Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:09.8162786Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:09.8163492Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:09.8164054Z kernel = self.compile( 2025-05-07T20:32:09.8164628Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:09.8165334Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:09.8165831Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:09.8166080Z 2025-05-07T20:32:09.8166292Z self = 2025-05-07T20:32:09.8167458Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:09.8168966Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1cefadf4c0>} 2025-05-07T20:32:09.8170440Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:09.8171549Z context = 2025-05-07T20:32:09.8171867Z 2025-05-07T20:32:09.8172035Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:09.8172582Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:09.8173071Z module_map=module_map) 2025-05-07T20:32:09.8173439Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:09.8173799Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:09.8174066Z E ^ 2025-05-07T20:32:09.8174551Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:09.8175042Z 2025-05-07T20:32:09.8175489Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:09.8176132Z 2025-05-07T20:32:10.0873473Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:10.0874040Z self=, 2025-05-07T20:32:10.0874631Z T=1, 2025-05-07T20:32:10.0874879Z D=5120, 2025-05-07T20:32:10.0875082Z scale_ub=None, 2025-05-07T20:32:10.0875307Z contiguous=False, 2025-05-07T20:32:10.0875546Z compiled=False, 2025-05-07T20:32:10.0875765Z ) 2025-05-07T20:32:10.0876096Z self = 2025-05-07T20:32:10.0876619Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:10.0876902Z 2025-05-07T20:32:10.0876989Z @given( 2025-05-07T20:32:10.0877224Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:10.0877555Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:10.0877880Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:10.0878240Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:10.0878584Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:10.0878897Z ) 2025-05-07T20:32:10.0879273Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:10.0879744Z def test_silu_mul_quant( 2025-05-07T20:32:10.0880001Z self, 2025-05-07T20:32:10.0880207Z T: int, 2025-05-07T20:32:10.0880408Z D: int, 2025-05-07T20:32:10.0880638Z scale_ub: Optional[float], 2025-05-07T20:32:10.0880926Z contiguous: bool, 2025-05-07T20:32:10.0881173Z compiled: bool, 2025-05-07T20:32:10.0881417Z ) -> None: 2025-05-07T20:32:10.0881642Z torch.manual_seed(2025) 2025-05-07T20:32:10.0881896Z 2025-05-07T20:32:10.0882180Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:10.0882554Z 2025-05-07T20:32:10.0883057Z x_sign = torch.sign(x) 2025-05-07T20:32:10.0883384Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:10.0883716Z x = x_sign * x_clamp 2025-05-07T20:32:10.0883973Z x0 = x[:, :D] 2025-05-07T20:32:10.0884198Z x1 = x[:, D:] 2025-05-07T20:32:10.0884714Z 2025-05-07T20:32:10.0884919Z if contiguous: 2025-05-07T20:32:10.0885156Z x0 = x0.contiguous() 2025-05-07T20:32:10.0885430Z x1 = x1.contiguous() 2025-05-07T20:32:10.0893289Z 2025-05-07T20:32:10.0893516Z if scale_ub is not None: 2025-05-07T20:32:10.0893823Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:10.0894185Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:10.0894521Z ) 2025-05-07T20:32:10.0894727Z else: 2025-05-07T20:32:10.0894946Z scale_ub_tensor = None 2025-05-07T20:32:10.0895212Z 2025-05-07T20:32:10.0895453Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:10.0895791Z op = silu_mul_quant 2025-05-07T20:32:10.0896070Z if compiled: 2025-05-07T20:32:10.0896328Z op = torch.compile(op) 2025-05-07T20:32:10.0896645Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:10.0896938Z 2025-05-07T20:32:10.0897143Z > y_fp8, y_scale = fn() 2025-05-07T20:32:10.0897327Z 2025-05-07T20:32:10.0897432Z moe/activation_test.py:117: 2025-05-07T20:32:10.0897751Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:10.0898105Z moe/activation_test.py:115: in fn 2025-05-07T20:32:10.0898408Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:10.0899163Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:10.0899920Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:10.0900486Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:10.0901226Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:10.0902145Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:10.0902732Z kernel = self.compile( 2025-05-07T20:32:10.0903315Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:10.0904031Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:10.0904460Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:10.0904708Z 2025-05-07T20:32:10.0904927Z self = 2025-05-07T20:32:10.0906108Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:10.0907630Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1cefadf820>} 2025-05-07T20:32:10.0909118Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:10.0910344Z context = 2025-05-07T20:32:10.0910652Z 2025-05-07T20:32:10.0910824Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:10.0911380Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:10.0911875Z module_map=module_map) 2025-05-07T20:32:10.0912248Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:10.0912613Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:10.0912882Z E ^ 2025-05-07T20:32:10.0913374Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:10.0913864Z 2025-05-07T20:32:10.0914403Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:10.0914958Z 2025-05-07T20:32:10.0915061Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:10.0915528Z self=, 2025-05-07T20:32:10.0915957Z T=4096, 2025-05-07T20:32:10.0916154Z D=7168, 2025-05-07T20:32:10.0916349Z scale_ub=1200.0, 2025-05-07T20:32:10.0916578Z contiguous=False, 2025-05-07T20:32:10.0916813Z compiled=False, 2025-05-07T20:32:10.0917023Z ) 2025-05-07T20:32:10.0917352Z self = 2025-05-07T20:32:10.0917879Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:10.0918177Z 2025-05-07T20:32:10.0918262Z @given( 2025-05-07T20:32:10.0918495Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:10.0918830Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:10.0919155Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:10.0919495Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:10.0919840Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:10.0920147Z ) 2025-05-07T20:32:10.0920517Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:10.0920979Z def test_silu_mul_quant( 2025-05-07T20:32:10.0921228Z self, 2025-05-07T20:32:10.0921432Z T: int, 2025-05-07T20:32:10.0921629Z D: int, 2025-05-07T20:32:10.0921855Z scale_ub: Optional[float], 2025-05-07T20:32:10.0922137Z contiguous: bool, 2025-05-07T20:32:10.0922379Z compiled: bool, 2025-05-07T20:32:10.0922700Z ) -> None: 2025-05-07T20:32:10.0922922Z torch.manual_seed(2025) 2025-05-07T20:32:10.0923170Z 2025-05-07T20:32:10.0923454Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:10.0923831Z 2025-05-07T20:32:10.0924026Z x_sign = torch.sign(x) 2025-05-07T20:32:10.0924329Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:10.0924657Z x = x_sign * x_clamp 2025-05-07T20:32:10.0924903Z x0 = x[:, :D] 2025-05-07T20:32:10.0925130Z x1 = x[:, D:] 2025-05-07T20:32:10.0925352Z 2025-05-07T20:32:10.0925537Z if contiguous: 2025-05-07T20:32:10.0925784Z x0 = x0.contiguous() 2025-05-07T20:32:10.0926059Z x1 = x1.contiguous() 2025-05-07T20:32:10.0926314Z 2025-05-07T20:32:10.0926505Z if scale_ub is not None: 2025-05-07T20:32:10.0926792Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:10.0927140Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:10.0927465Z ) 2025-05-07T20:32:10.0927659Z else: 2025-05-07T20:32:10.0927873Z scale_ub_tensor = None 2025-05-07T20:32:10.0928127Z 2025-05-07T20:32:10.0928363Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:10.0928693Z op = silu_mul_quant 2025-05-07T20:32:10.0928942Z if compiled: 2025-05-07T20:32:10.0929194Z op = torch.compile(op) 2025-05-07T20:32:10.0929502Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:10.0929783Z 2025-05-07T20:32:10.0929983Z > y_fp8, y_scale = fn() 2025-05-07T20:32:10.0930153Z 2025-05-07T20:32:10.0930260Z moe/activation_test.py:117: 2025-05-07T20:32:10.0930569Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:10.0930919Z moe/activation_test.py:115: in fn 2025-05-07T20:32:10.0931214Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:10.0931955Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:10.0932709Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:10.0933368Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:10.0934107Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:10.0934822Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:10.0935390Z kernel = self.compile( 2025-05-07T20:32:10.0935961Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:10.0936669Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:10.0937082Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:10.0937336Z 2025-05-07T20:32:10.0937552Z self = 2025-05-07T20:32:10.0938733Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:10.0940281Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1d7c1dcaf0>} 2025-05-07T20:32:10.0941750Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:10.0942853Z context = 2025-05-07T20:32:10.0943166Z 2025-05-07T20:32:10.0943335Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:10.0943969Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:10.0944460Z module_map=module_map) 2025-05-07T20:32:10.0944839Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:10.0945205Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:10.0945476Z E ^ 2025-05-07T20:32:10.0945971Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:10.0946470Z 2025-05-07T20:32:10.0946923Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:10.0947490Z 2025-05-07T20:32:10.0947596Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:10.0948030Z self=, 2025-05-07T20:32:10.0948456Z T=16384, 2025-05-07T20:32:10.0948684Z D=7168, 2025-05-07T20:32:10.0948900Z scale_ub=None, 2025-05-07T20:32:10.0949113Z contiguous=True, 2025-05-07T20:32:10.0949337Z compiled=True, 2025-05-07T20:32:10.0949541Z ) 2025-05-07T20:32:10.2138451Z self = 2025-05-07T20:32:10.2139506Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:10.2139816Z 2025-05-07T20:32:10.2139898Z @given( 2025-05-07T20:32:10.2140129Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:10.2140456Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:10.2140767Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:10.2141109Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:10.2141447Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:10.2141741Z ) 2025-05-07T20:32:10.2142093Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:10.2142558Z def test_silu_mul_quant( 2025-05-07T20:32:10.2142806Z self, 2025-05-07T20:32:10.2142994Z T: int, 2025-05-07T20:32:10.2143192Z D: int, 2025-05-07T20:32:10.2143413Z scale_ub: Optional[float], 2025-05-07T20:32:10.2143979Z contiguous: bool, 2025-05-07T20:32:10.2144227Z compiled: bool, 2025-05-07T20:32:10.2144459Z ) -> None: 2025-05-07T20:32:10.2144673Z torch.manual_seed(2025) 2025-05-07T20:32:10.2144921Z 2025-05-07T20:32:10.2145198Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:10.2145545Z 2025-05-07T20:32:10.2145741Z x_sign = torch.sign(x) 2025-05-07T20:32:10.2146035Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:10.2146352Z x = x_sign * x_clamp 2025-05-07T20:32:10.2146592Z x0 = x[:, :D] 2025-05-07T20:32:10.2146809Z x1 = x[:, D:] 2025-05-07T20:32:10.2147019Z 2025-05-07T20:32:10.2147196Z if contiguous: 2025-05-07T20:32:10.2147429Z x0 = x0.contiguous() 2025-05-07T20:32:10.2147694Z x1 = x1.contiguous() 2025-05-07T20:32:10.2147933Z 2025-05-07T20:32:10.2148125Z if scale_ub is not None: 2025-05-07T20:32:10.2148406Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:10.2148773Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:10.2149114Z ) 2025-05-07T20:32:10.2149309Z else: 2025-05-07T20:32:10.2149516Z scale_ub_tensor = None 2025-05-07T20:32:10.2149906Z 2025-05-07T20:32:10.2150136Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:10.2150453Z op = silu_mul_quant 2025-05-07T20:32:10.2150705Z if compiled: 2025-05-07T20:32:10.2150950Z op = torch.compile(op) 2025-05-07T20:32:10.2151244Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:10.2151526Z 2025-05-07T20:32:10.2151718Z > y_fp8, y_scale = fn() 2025-05-07T20:32:10.2151887Z 2025-05-07T20:32:10.2152147Z moe/activation_test.py:117: 2025-05-07T20:32:10.2152441Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:10.2152784Z moe/activation_test.py:115: in fn 2025-05-07T20:32:10.2153075Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:10.2153658Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:10.2154258Z return fn(*args, **kwargs) 2025-05-07T20:32:10.2154966Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:10.2155713Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:10.2156274Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:10.2157006Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:10.2157711Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:10.2158277Z kernel = self.compile( 2025-05-07T20:32:10.2158864Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:10.2159568Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:10.2159975Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:10.2160223Z 2025-05-07T20:32:10.2160437Z self = 2025-05-07T20:32:10.2161606Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:10.2163124Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1cefeec790>} 2025-05-07T20:32:10.2164700Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:10.2165810Z context = 2025-05-07T20:32:10.2166120Z 2025-05-07T20:32:10.2166289Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:10.2166843Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:10.2167328Z module_map=module_map) 2025-05-07T20:32:10.2167706Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:10.2168072Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:10.2168340Z E ^ 2025-05-07T20:32:10.2168827Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:10.2169324Z 2025-05-07T20:32:10.2169771Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:10.2170328Z 2025-05-07T20:32:10.2170437Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:10.2170862Z self=, 2025-05-07T20:32:10.2171285Z T=4096, 2025-05-07T20:32:10.2171478Z D=5120, 2025-05-07T20:32:10.2171671Z scale_ub=None, 2025-05-07T20:32:10.2171884Z contiguous=False, 2025-05-07T20:32:10.2172111Z compiled=True, 2025-05-07T20:32:10.2172321Z ) 2025-05-07T20:32:10.2172641Z self = 2025-05-07T20:32:10.2173156Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:10.2173441Z 2025-05-07T20:32:10.2173525Z @given( 2025-05-07T20:32:10.2173748Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:10.2174153Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:10.2174465Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:10.2174805Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:10.2175145Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:10.2175442Z ) 2025-05-07T20:32:10.2175802Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:10.2176260Z def test_silu_mul_quant( 2025-05-07T20:32:10.2176509Z self, 2025-05-07T20:32:10.2176700Z T: int, 2025-05-07T20:32:10.2176894Z D: int, 2025-05-07T20:32:10.2177115Z scale_ub: Optional[float], 2025-05-07T20:32:10.2177392Z contiguous: bool, 2025-05-07T20:32:10.2177629Z compiled: bool, 2025-05-07T20:32:10.2177855Z ) -> None: 2025-05-07T20:32:10.2178071Z torch.manual_seed(2025) 2025-05-07T20:32:10.2178314Z 2025-05-07T20:32:10.2178599Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:10.2178954Z 2025-05-07T20:32:10.2179145Z x_sign = torch.sign(x) 2025-05-07T20:32:10.2179445Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:10.2179773Z x = x_sign * x_clamp 2025-05-07T20:32:10.2180016Z x0 = x[:, :D] 2025-05-07T20:32:10.2180229Z x1 = x[:, D:] 2025-05-07T20:32:10.2180445Z 2025-05-07T20:32:10.2180635Z if contiguous: 2025-05-07T20:32:10.2180862Z x0 = x0.contiguous() 2025-05-07T20:32:10.2181126Z x1 = x1.contiguous() 2025-05-07T20:32:10.2181375Z 2025-05-07T20:32:10.2181563Z if scale_ub is not None: 2025-05-07T20:32:10.2181842Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:10.2182181Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:10.2182496Z ) 2025-05-07T20:32:10.2182693Z else: 2025-05-07T20:32:10.2183180Z scale_ub_tensor = None 2025-05-07T20:32:10.2183438Z 2025-05-07T20:32:10.2183670Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:10.2183998Z op = silu_mul_quant 2025-05-07T20:32:10.2184372Z if compiled: 2025-05-07T20:32:10.2184626Z op = torch.compile(op) 2025-05-07T20:32:10.2184930Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:10.2185206Z 2025-05-07T20:32:10.2185398Z > y_fp8, y_scale = fn() 2025-05-07T20:32:10.2185568Z 2025-05-07T20:32:10.2185665Z moe/activation_test.py:117: 2025-05-07T20:32:10.2185963Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:10.2186302Z moe/activation_test.py:115: in fn 2025-05-07T20:32:10.2186588Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:10.2187174Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:10.2187765Z return fn(*args, **kwargs) 2025-05-07T20:32:10.2188474Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:10.2189215Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:10.2189874Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:10.2190599Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:10.2191307Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:10.2191876Z kernel = self.compile( 2025-05-07T20:32:10.2192440Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:10.2193135Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:10.2193548Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:10.2193929Z 2025-05-07T20:32:10.2194147Z self = 2025-05-07T20:32:10.2195311Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:10.2196810Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1cefdb2550>} 2025-05-07T20:32:10.2198276Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:10.2199382Z context = 2025-05-07T20:32:10.2199684Z 2025-05-07T20:32:10.2199857Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:10.2200401Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:10.2200895Z module_map=module_map) 2025-05-07T20:32:10.2201273Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:10.2201633Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:10.2201900Z E ^ 2025-05-07T20:32:10.2202392Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:10.2202881Z 2025-05-07T20:32:10.2203334Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:10.2203887Z 2025-05-07T20:32:10.6223654Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:10.6224864Z self=, 2025-05-07T20:32:10.6226074Z T=4096, 2025-05-07T20:32:10.6226647Z D=5120, 2025-05-07T20:32:10.6227036Z scale_ub=1200.0, 2025-05-07T20:32:10.6227565Z contiguous=False, 2025-05-07T20:32:10.6228215Z compiled=False, 2025-05-07T20:32:10.6228773Z ) 2025-05-07T20:32:10.6229969Z self = 2025-05-07T20:32:10.6230693Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:10.6231093Z 2025-05-07T20:32:10.6231175Z @given( 2025-05-07T20:32:10.6231408Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:10.6231728Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:10.6232041Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:10.6232380Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:10.6232718Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:10.6233021Z ) 2025-05-07T20:32:10.6233377Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:10.6233849Z def test_silu_mul_quant( 2025-05-07T20:32:10.6234096Z self, 2025-05-07T20:32:10.6234281Z T: int, 2025-05-07T20:32:10.6234478Z D: int, 2025-05-07T20:32:10.6234705Z scale_ub: Optional[float], 2025-05-07T20:32:10.6234973Z contiguous: bool, 2025-05-07T20:32:10.6235217Z compiled: bool, 2025-05-07T20:32:10.6235446Z ) -> None: 2025-05-07T20:32:10.6235657Z torch.manual_seed(2025) 2025-05-07T20:32:10.6235905Z 2025-05-07T20:32:10.6236184Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:10.6236539Z 2025-05-07T20:32:10.6236737Z x_sign = torch.sign(x) 2025-05-07T20:32:10.6237035Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:10.6237348Z x = x_sign * x_clamp 2025-05-07T20:32:10.6237595Z x0 = x[:, :D] 2025-05-07T20:32:10.6237814Z x1 = x[:, D:] 2025-05-07T20:32:10.6238027Z 2025-05-07T20:32:10.6238212Z if contiguous: 2025-05-07T20:32:10.6238615Z x0 = x0.contiguous() 2025-05-07T20:32:10.6238908Z x1 = x1.contiguous() 2025-05-07T20:32:10.6239158Z 2025-05-07T20:32:10.6239348Z if scale_ub is not None: 2025-05-07T20:32:10.6239636Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:10.6239982Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:10.6240308Z ) 2025-05-07T20:32:10.6240498Z else: 2025-05-07T20:32:10.6240712Z scale_ub_tensor = None 2025-05-07T20:32:10.6240975Z 2025-05-07T20:32:10.6241202Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:10.6241532Z op = silu_mul_quant 2025-05-07T20:32:10.6241788Z if compiled: 2025-05-07T20:32:10.6242033Z op = torch.compile(op) 2025-05-07T20:32:10.6242336Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:10.6242617Z 2025-05-07T20:32:10.6242806Z > y_fp8, y_scale = fn() 2025-05-07T20:32:10.6242985Z 2025-05-07T20:32:10.6243083Z moe/activation_test.py:117: 2025-05-07T20:32:10.6243388Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:10.6243734Z moe/activation_test.py:115: in fn 2025-05-07T20:32:10.6244023Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:10.6244767Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:10.6245513Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:10.6246079Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:10.6246812Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:10.6247524Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:10.6248092Z kernel = self.compile( 2025-05-07T20:32:10.6248664Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:10.6249368Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:10.6249892Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:10.6250137Z 2025-05-07T20:32:10.6250349Z self = 2025-05-07T20:32:10.6251523Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:10.6253036Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1cefcc50d0>} 2025-05-07T20:32:10.6254502Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:10.6255623Z context = 2025-05-07T20:32:10.6255926Z 2025-05-07T20:32:10.6256093Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:10.6256640Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:10.6257131Z module_map=module_map) 2025-05-07T20:32:10.6257509Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:10.6257867Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:10.6258131Z E ^ 2025-05-07T20:32:10.6258626Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:10.6259115Z 2025-05-07T20:32:10.6259562Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:10.6267268Z 2025-05-07T20:32:10.6267400Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:10.6267867Z self=, 2025-05-07T20:32:10.6268307Z T=4096, 2025-05-07T20:32:10.6268493Z D=5120, 2025-05-07T20:32:10.6268692Z scale_ub=1200.0, 2025-05-07T20:32:10.6268925Z contiguous=False, 2025-05-07T20:32:10.6269148Z compiled=True, 2025-05-07T20:32:10.6269359Z ) 2025-05-07T20:32:10.6269831Z self = 2025-05-07T20:32:10.6270355Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:10.6270655Z 2025-05-07T20:32:10.6270733Z @given( 2025-05-07T20:32:10.6270971Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:10.6271289Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:10.6271606Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:10.6271963Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:10.6272315Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:10.6272612Z ) 2025-05-07T20:32:10.6272987Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:10.6273459Z def test_silu_mul_quant( 2025-05-07T20:32:10.6273701Z self, 2025-05-07T20:32:10.6273897Z T: int, 2025-05-07T20:32:10.6274096Z D: int, 2025-05-07T20:32:10.6274312Z scale_ub: Optional[float], 2025-05-07T20:32:10.6274598Z contiguous: bool, 2025-05-07T20:32:10.6274843Z compiled: bool, 2025-05-07T20:32:10.6275067Z ) -> None: 2025-05-07T20:32:10.6275288Z torch.manual_seed(2025) 2025-05-07T20:32:10.6275539Z 2025-05-07T20:32:10.6275814Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:10.6276182Z 2025-05-07T20:32:10.6276376Z x_sign = torch.sign(x) 2025-05-07T20:32:10.6276677Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:10.6277004Z x = x_sign * x_clamp 2025-05-07T20:32:10.6277255Z x0 = x[:, :D] 2025-05-07T20:32:10.6277603Z x1 = x[:, D:] 2025-05-07T20:32:10.6277810Z 2025-05-07T20:32:10.6277996Z if contiguous: 2025-05-07T20:32:10.6278231Z x0 = x0.contiguous() 2025-05-07T20:32:10.6278486Z x1 = x1.contiguous() 2025-05-07T20:32:10.6278741Z 2025-05-07T20:32:10.6278976Z if scale_ub is not None: 2025-05-07T20:32:10.6279261Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:10.6279604Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:10.6279926Z ) 2025-05-07T20:32:10.6280121Z else: 2025-05-07T20:32:10.6280327Z scale_ub_tensor = None 2025-05-07T20:32:10.6280583Z 2025-05-07T20:32:10.6280817Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:10.6281141Z op = silu_mul_quant 2025-05-07T20:32:10.6281398Z if compiled: 2025-05-07T20:32:10.6281649Z op = torch.compile(op) 2025-05-07T20:32:10.6281957Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:10.6282241Z 2025-05-07T20:32:10.6282435Z > y_fp8, y_scale = fn() 2025-05-07T20:32:10.6282602Z 2025-05-07T20:32:10.6282709Z moe/activation_test.py:117: 2025-05-07T20:32:10.6283349Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:10.6283699Z moe/activation_test.py:115: in fn 2025-05-07T20:32:10.6283986Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:10.6284571Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:10.6285172Z return fn(*args, **kwargs) 2025-05-07T20:32:10.6285882Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:10.6286782Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:10.6287351Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:10.6288087Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:10.6288794Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:10.6289355Z kernel = self.compile( 2025-05-07T20:32:10.6289927Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:10.6290627Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:10.6291042Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:10.6291286Z 2025-05-07T20:32:10.6291499Z self = 2025-05-07T20:32:10.6292681Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:10.6294193Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1cefcc5dc0>} 2025-05-07T20:32:10.6295660Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:10.6296764Z context = 2025-05-07T20:32:10.6297074Z 2025-05-07T20:32:10.6297244Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:10.6297795Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:10.6298294Z module_map=module_map) 2025-05-07T20:32:10.6298666Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:10.6299152Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:10.6299426Z E ^ 2025-05-07T20:32:10.6299920Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:10.6300418Z 2025-05-07T20:32:10.6300873Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:10.6301439Z 2025-05-07T20:32:10.9048453Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:10.9049111Z self=, 2025-05-07T20:32:10.9049740Z T=2048, 2025-05-07T20:32:10.9050016Z D=7168, 2025-05-07T20:32:10.9050215Z scale_ub=1200.0, 2025-05-07T20:32:10.9050449Z contiguous=False, 2025-05-07T20:32:10.9050714Z compiled=False, 2025-05-07T20:32:10.9050926Z ) 2025-05-07T20:32:10.9051261Z self = 2025-05-07T20:32:10.9051807Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:10.9052104Z 2025-05-07T20:32:10.9052185Z @given( 2025-05-07T20:32:10.9052417Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:10.9052741Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:10.9053051Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:10.9053392Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:10.9053735Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:10.9054033Z ) 2025-05-07T20:32:10.9054391Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:10.9054858Z def test_silu_mul_quant( 2025-05-07T20:32:10.9055113Z self, 2025-05-07T20:32:10.9055652Z T: int, 2025-05-07T20:32:10.9055852Z D: int, 2025-05-07T20:32:10.9056073Z scale_ub: Optional[float], 2025-05-07T20:32:10.9056345Z contiguous: bool, 2025-05-07T20:32:10.9056588Z compiled: bool, 2025-05-07T20:32:10.9056827Z ) -> None: 2025-05-07T20:32:10.9057041Z torch.manual_seed(2025) 2025-05-07T20:32:10.9057291Z 2025-05-07T20:32:10.9057574Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:10.9057930Z 2025-05-07T20:32:10.9058134Z x_sign = torch.sign(x) 2025-05-07T20:32:10.9058430Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:10.9058747Z x = x_sign * x_clamp 2025-05-07T20:32:10.9059021Z x0 = x[:, :D] 2025-05-07T20:32:10.9059267Z x1 = x[:, D:] 2025-05-07T20:32:10.9059480Z 2025-05-07T20:32:10.9059662Z if contiguous: 2025-05-07T20:32:10.9059898Z x0 = x0.contiguous() 2025-05-07T20:32:10.9060164Z x1 = x1.contiguous() 2025-05-07T20:32:10.9060411Z 2025-05-07T20:32:10.9060607Z if scale_ub is not None: 2025-05-07T20:32:10.9060889Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:10.9061234Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:10.9061565Z ) 2025-05-07T20:32:10.9061758Z else: 2025-05-07T20:32:10.9061968Z scale_ub_tensor = None 2025-05-07T20:32:10.9062229Z 2025-05-07T20:32:10.9062464Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:10.9062783Z op = silu_mul_quant 2025-05-07T20:32:10.9063045Z if compiled: 2025-05-07T20:32:10.9063297Z op = torch.compile(op) 2025-05-07T20:32:10.9063597Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:10.9063890Z 2025-05-07T20:32:10.9064089Z > y_fp8, y_scale = fn() 2025-05-07T20:32:10.9064258Z 2025-05-07T20:32:10.9064363Z moe/activation_test.py:117: 2025-05-07T20:32:10.9064660Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:10.9065012Z moe/activation_test.py:115: in fn 2025-05-07T20:32:10.9065303Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:10.9066193Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:10.9066941Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:10.9067512Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:10.9068246Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:10.9068952Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:10.9069518Z kernel = self.compile( 2025-05-07T20:32:10.9070238Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:10.9070934Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:10.9071345Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:10.9071592Z 2025-05-07T20:32:10.9071811Z self = 2025-05-07T20:32:10.9072979Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:10.9074496Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1d7c027670>} 2025-05-07T20:32:10.9075967Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:10.9077161Z context = 2025-05-07T20:32:10.9077471Z 2025-05-07T20:32:10.9077652Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:10.9078197Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:10.9078682Z module_map=module_map) 2025-05-07T20:32:10.9079060Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:10.9079421Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:10.9079681Z E ^ 2025-05-07T20:32:10.9080173Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:10.9080660Z 2025-05-07T20:32:10.9081113Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:10.9081671Z 2025-05-07T20:32:10.9081784Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:10.9082205Z self=, 2025-05-07T20:32:10.9082628Z T=1, 2025-05-07T20:32:10.9083106Z D=7168, 2025-05-07T20:32:10.9083303Z scale_ub=None, 2025-05-07T20:32:10.9083517Z contiguous=True, 2025-05-07T20:32:10.9083744Z compiled=False, 2025-05-07T20:32:10.9083945Z ) 2025-05-07T20:32:10.9084267Z self = 2025-05-07T20:32:10.9084777Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:10.9085055Z 2025-05-07T20:32:10.9085140Z @given( 2025-05-07T20:32:10.9085364Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:10.9085687Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:10.9086003Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:10.9086338Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:10.9086681Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:10.9086981Z ) 2025-05-07T20:32:10.9087336Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:10.9087932Z def test_silu_mul_quant( 2025-05-07T20:32:10.9088184Z self, 2025-05-07T20:32:10.9088373Z T: int, 2025-05-07T20:32:10.9088575Z D: int, 2025-05-07T20:32:10.9088797Z scale_ub: Optional[float], 2025-05-07T20:32:10.9089092Z contiguous: bool, 2025-05-07T20:32:10.9089371Z compiled: bool, 2025-05-07T20:32:10.9089595Z ) -> None: 2025-05-07T20:32:10.9089806Z torch.manual_seed(2025) 2025-05-07T20:32:10.9090053Z 2025-05-07T20:32:10.9090330Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:10.9090681Z 2025-05-07T20:32:10.9090875Z x_sign = torch.sign(x) 2025-05-07T20:32:10.9091170Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:10.9091486Z x = x_sign * x_clamp 2025-05-07T20:32:10.9091738Z x0 = x[:, :D] 2025-05-07T20:32:10.9091960Z x1 = x[:, D:] 2025-05-07T20:32:10.9092160Z 2025-05-07T20:32:10.9092343Z if contiguous: 2025-05-07T20:32:10.9092582Z x0 = x0.contiguous() 2025-05-07T20:32:10.9092839Z x1 = x1.contiguous() 2025-05-07T20:32:10.9093085Z 2025-05-07T20:32:10.9093280Z if scale_ub is not None: 2025-05-07T20:32:10.9093548Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:10.9093890Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:10.9094211Z ) 2025-05-07T20:32:10.9094401Z else: 2025-05-07T20:32:10.9094606Z scale_ub_tensor = None 2025-05-07T20:32:10.9094865Z 2025-05-07T20:32:10.9095095Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:10.9095416Z op = silu_mul_quant 2025-05-07T20:32:10.9095668Z if compiled: 2025-05-07T20:32:10.9095918Z op = torch.compile(op) 2025-05-07T20:32:10.9096763Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:10.9097049Z 2025-05-07T20:32:10.9097241Z > y_fp8, y_scale = fn() 2025-05-07T20:32:10.9097407Z 2025-05-07T20:32:10.9097510Z moe/activation_test.py:117: 2025-05-07T20:32:10.9097811Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:10.9098154Z moe/activation_test.py:115: in fn 2025-05-07T20:32:10.9098442Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:10.9099176Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:10.9099919Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:10.9100485Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:10.9101208Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:10.9101921Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:10.9102487Z kernel = self.compile( 2025-05-07T20:32:10.9103061Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:10.9103753Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:10.9104166Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:10.9104408Z 2025-05-07T20:32:10.9104627Z self = 2025-05-07T20:32:10.9105793Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:10.9107286Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1cef902160>} 2025-05-07T20:32:10.9108840Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:10.9110051Z context = 2025-05-07T20:32:10.9110356Z 2025-05-07T20:32:10.9110531Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:10.9111076Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:10.9111565Z module_map=module_map) 2025-05-07T20:32:10.9111942Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:10.9112308Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:10.9112562Z E ^ 2025-05-07T20:32:10.9113058Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:10.9113553Z 2025-05-07T20:32:10.9114012Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:10.9114566Z 2025-05-07T20:32:10.9114668Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:10.9115095Z self=, 2025-05-07T20:32:10.9115513Z T=16384, 2025-05-07T20:32:10.9115705Z D=7168, 2025-05-07T20:32:10.9115891Z scale_ub=1200.0, 2025-05-07T20:32:10.9116115Z contiguous=False, 2025-05-07T20:32:10.9116336Z compiled=True, 2025-05-07T20:32:10.9116534Z ) 2025-05-07T20:32:11.1032729Z self = 2025-05-07T20:32:11.1033441Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:11.1033817Z 2025-05-07T20:32:11.1033898Z @given( 2025-05-07T20:32:11.1034514Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:11.1034834Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:11.1035159Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:11.1035498Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:11.1035837Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:11.1036127Z ) 2025-05-07T20:32:11.1036493Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:11.1036959Z def test_silu_mul_quant( 2025-05-07T20:32:11.1037195Z self, 2025-05-07T20:32:11.1037393Z T: int, 2025-05-07T20:32:11.1037597Z D: int, 2025-05-07T20:32:11.1037808Z scale_ub: Optional[float], 2025-05-07T20:32:11.1038082Z contiguous: bool, 2025-05-07T20:32:11.1038325Z compiled: bool, 2025-05-07T20:32:11.1038560Z ) -> None: 2025-05-07T20:32:11.1038780Z torch.manual_seed(2025) 2025-05-07T20:32:11.1039081Z 2025-05-07T20:32:11.1039354Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:11.1039718Z 2025-05-07T20:32:11.1039908Z x_sign = torch.sign(x) 2025-05-07T20:32:11.1040200Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:11.1040520Z x = x_sign * x_clamp 2025-05-07T20:32:11.1040765Z x0 = x[:, :D] 2025-05-07T20:32:11.1040985Z x1 = x[:, D:] 2025-05-07T20:32:11.1041189Z 2025-05-07T20:32:11.1041377Z if contiguous: 2025-05-07T20:32:11.1041609Z x0 = x0.contiguous() 2025-05-07T20:32:11.1041868Z x1 = x1.contiguous() 2025-05-07T20:32:11.1042112Z 2025-05-07T20:32:11.1042303Z if scale_ub is not None: 2025-05-07T20:32:11.1042575Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:11.1042919Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:11.1043237Z ) 2025-05-07T20:32:11.1043423Z else: 2025-05-07T20:32:11.1043641Z scale_ub_tensor = None 2025-05-07T20:32:11.1043894Z 2025-05-07T20:32:11.1044119Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:11.1044598Z op = silu_mul_quant 2025-05-07T20:32:11.1044855Z if compiled: 2025-05-07T20:32:11.1045098Z op = torch.compile(op) 2025-05-07T20:32:11.1045401Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.1045685Z 2025-05-07T20:32:11.1045872Z > y_fp8, y_scale = fn() 2025-05-07T20:32:11.1046044Z 2025-05-07T20:32:11.1046142Z moe/activation_test.py:117: 2025-05-07T20:32:11.1046441Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.1046783Z moe/activation_test.py:115: in fn 2025-05-07T20:32:11.1047063Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.1047656Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:11.1048262Z return fn(*args, **kwargs) 2025-05-07T20:32:11.1048962Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:11.1049714Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:11.1050285Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:11.1051017Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:11.1051719Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:11.1052287Z kernel = self.compile( 2025-05-07T20:32:11.1052858Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:11.1053559Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:11.1053972Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.1054311Z 2025-05-07T20:32:11.1054525Z self = 2025-05-07T20:32:11.1055706Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:11.1057235Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1cef9024c0>} 2025-05-07T20:32:11.1058700Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:11.1059812Z context = 2025-05-07T20:32:11.1060129Z 2025-05-07T20:32:11.1060297Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:11.1060846Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:11.1061341Z module_map=module_map) 2025-05-07T20:32:11.1061719Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:11.1062085Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:11.1062341Z E ^ 2025-05-07T20:32:11.1062831Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:11.1063322Z 2025-05-07T20:32:11.1063769Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:11.1064322Z 2025-05-07T20:32:11.1064431Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:11.1064855Z self=, 2025-05-07T20:32:11.1065278Z T=1, 2025-05-07T20:32:11.1065461Z D=7168, 2025-05-07T20:32:11.1065648Z scale_ub=None, 2025-05-07T20:32:11.1065866Z contiguous=False, 2025-05-07T20:32:11.1066091Z compiled=False, 2025-05-07T20:32:11.1066375Z ) 2025-05-07T20:32:11.1066702Z self = 2025-05-07T20:32:11.1067212Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:11.1067487Z 2025-05-07T20:32:11.1067570Z @given( 2025-05-07T20:32:11.1067794Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:11.1068112Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:11.1068423Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:11.1068757Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:11.1069093Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:11.1069389Z ) 2025-05-07T20:32:11.1069912Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:11.1070389Z def test_silu_mul_quant( 2025-05-07T20:32:11.1070635Z self, 2025-05-07T20:32:11.1070829Z T: int, 2025-05-07T20:32:11.1071025Z D: int, 2025-05-07T20:32:11.1071247Z scale_ub: Optional[float], 2025-05-07T20:32:11.1071525Z contiguous: bool, 2025-05-07T20:32:11.1071787Z compiled: bool, 2025-05-07T20:32:11.1072007Z ) -> None: 2025-05-07T20:32:11.1072225Z torch.manual_seed(2025) 2025-05-07T20:32:11.1072474Z 2025-05-07T20:32:11.1072744Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:11.1073103Z 2025-05-07T20:32:11.1073298Z x_sign = torch.sign(x) 2025-05-07T20:32:11.1073586Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:11.1073905Z x = x_sign * x_clamp 2025-05-07T20:32:11.1074148Z x0 = x[:, :D] 2025-05-07T20:32:11.1074359Z x1 = x[:, D:] 2025-05-07T20:32:11.1074658Z 2025-05-07T20:32:11.1074843Z if contiguous: 2025-05-07T20:32:11.1075070Z x0 = x0.contiguous() 2025-05-07T20:32:11.1075335Z x1 = x1.contiguous() 2025-05-07T20:32:11.1075580Z 2025-05-07T20:32:11.1075771Z if scale_ub is not None: 2025-05-07T20:32:11.1083737Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:11.1084150Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:11.1084482Z ) 2025-05-07T20:32:11.1084686Z else: 2025-05-07T20:32:11.1084897Z scale_ub_tensor = None 2025-05-07T20:32:11.1085168Z 2025-05-07T20:32:11.1085408Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:11.1085743Z op = silu_mul_quant 2025-05-07T20:32:11.1085998Z if compiled: 2025-05-07T20:32:11.1086256Z op = torch.compile(op) 2025-05-07T20:32:11.1086566Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.1086849Z 2025-05-07T20:32:11.1087061Z > y_fp8, y_scale = fn() 2025-05-07T20:32:11.1087234Z 2025-05-07T20:32:11.1087344Z moe/activation_test.py:117: 2025-05-07T20:32:11.1087651Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.1088007Z moe/activation_test.py:115: in fn 2025-05-07T20:32:11.1088303Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.1089047Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:11.1089811Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:11.1090387Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:11.1091131Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:11.1091843Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:11.1092431Z kernel = self.compile( 2025-05-07T20:32:11.1093012Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:11.1093894Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:11.1094314Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.1094564Z 2025-05-07T20:32:11.1094783Z self = 2025-05-07T20:32:11.1095969Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:11.1097480Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1cefc69820>} 2025-05-07T20:32:11.1098942Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:11.1100061Z context = 2025-05-07T20:32:11.1100375Z 2025-05-07T20:32:11.1100543Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:11.1101092Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:11.1101581Z module_map=module_map) 2025-05-07T20:32:11.1101960Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:11.1102331Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:11.1102589Z E ^ 2025-05-07T20:32:11.1103083Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:11.1103569Z 2025-05-07T20:32:11.1104154Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:11.1104707Z 2025-05-07T20:32:11.1104812Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:11.1105242Z self=, 2025-05-07T20:32:11.1105666Z T=2048, 2025-05-07T20:32:11.1105848Z D=7168, 2025-05-07T20:32:11.1106042Z scale_ub=None, 2025-05-07T20:32:11.1106255Z contiguous=False, 2025-05-07T20:32:11.1106480Z compiled=True, 2025-05-07T20:32:11.1106678Z ) 2025-05-07T20:32:11.2288336Z self = 2025-05-07T20:32:11.2289132Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:11.2289437Z 2025-05-07T20:32:11.2289518Z @given( 2025-05-07T20:32:11.2289756Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:11.2290076Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:11.2290418Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:11.2290760Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:11.2291103Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:11.2291400Z ) 2025-05-07T20:32:11.2291765Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:11.2292226Z def test_silu_mul_quant( 2025-05-07T20:32:11.2292475Z self, 2025-05-07T20:32:11.2292667Z T: int, 2025-05-07T20:32:11.2292857Z D: int, 2025-05-07T20:32:11.2293081Z scale_ub: Optional[float], 2025-05-07T20:32:11.2293360Z contiguous: bool, 2025-05-07T20:32:11.2293602Z compiled: bool, 2025-05-07T20:32:11.2293829Z ) -> None: 2025-05-07T20:32:11.2294046Z torch.manual_seed(2025) 2025-05-07T20:32:11.2294293Z 2025-05-07T20:32:11.2294562Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:11.2294929Z 2025-05-07T20:32:11.2295123Z x_sign = torch.sign(x) 2025-05-07T20:32:11.2295414Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:11.2295738Z x = x_sign * x_clamp 2025-05-07T20:32:11.2296506Z x0 = x[:, :D] 2025-05-07T20:32:11.2296729Z x1 = x[:, D:] 2025-05-07T20:32:11.2296940Z 2025-05-07T20:32:11.2297126Z if contiguous: 2025-05-07T20:32:11.2297357Z x0 = x0.contiguous() 2025-05-07T20:32:11.2297621Z x1 = x1.contiguous() 2025-05-07T20:32:11.2297865Z 2025-05-07T20:32:11.2298054Z if scale_ub is not None: 2025-05-07T20:32:11.2298334Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:11.2298683Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:11.2299003Z ) 2025-05-07T20:32:11.2299190Z else: 2025-05-07T20:32:11.2299404Z scale_ub_tensor = None 2025-05-07T20:32:11.2299664Z 2025-05-07T20:32:11.2299889Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:11.2300221Z op = silu_mul_quant 2025-05-07T20:32:11.2300480Z if compiled: 2025-05-07T20:32:11.2300727Z op = torch.compile(op) 2025-05-07T20:32:11.2301037Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.2301320Z 2025-05-07T20:32:11.2301506Z > y_fp8, y_scale = fn() 2025-05-07T20:32:11.2301680Z 2025-05-07T20:32:11.2301779Z moe/activation_test.py:117: 2025-05-07T20:32:11.2302083Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.2302421Z moe/activation_test.py:115: in fn 2025-05-07T20:32:11.2302705Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.2303301Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:11.2303901Z return fn(*args, **kwargs) 2025-05-07T20:32:11.2304602Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:11.2305512Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:11.2306090Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:11.2306812Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:11.2307526Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:11.2308100Z kernel = self.compile( 2025-05-07T20:32:11.2308673Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:11.2309418Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:11.2309962Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.2310205Z 2025-05-07T20:32:11.2310428Z self = 2025-05-07T20:32:11.2311611Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:11.2313218Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1cefa93790>} 2025-05-07T20:32:11.2314682Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:11.2315790Z context = 2025-05-07T20:32:11.2316101Z 2025-05-07T20:32:11.2316271Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:11.2316825Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:11.2317314Z module_map=module_map) 2025-05-07T20:32:11.2317776Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:11.2318147Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:11.2318411Z E ^ 2025-05-07T20:32:11.2318908Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:11.2319406Z 2025-05-07T20:32:11.2319854Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:11.2320408Z 2025-05-07T20:32:11.2320518Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:11.2320943Z self=, 2025-05-07T20:32:11.2321369Z T=4096, 2025-05-07T20:32:11.2321556Z D=7168, 2025-05-07T20:32:11.2321742Z scale_ub=None, 2025-05-07T20:32:11.2321964Z contiguous=False, 2025-05-07T20:32:11.2322192Z compiled=True, 2025-05-07T20:32:11.2322397Z ) 2025-05-07T20:32:11.2322717Z self = 2025-05-07T20:32:11.2323247Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:11.2323536Z 2025-05-07T20:32:11.2323618Z @given( 2025-05-07T20:32:11.2323842Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:11.2324167Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:11.2324486Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:11.2324817Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:11.2325157Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:11.2325454Z ) 2025-05-07T20:32:11.2325815Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:11.2326276Z def test_silu_mul_quant( 2025-05-07T20:32:11.2326610Z self, 2025-05-07T20:32:11.2326802Z T: int, 2025-05-07T20:32:11.2326994Z D: int, 2025-05-07T20:32:11.2327212Z scale_ub: Optional[float], 2025-05-07T20:32:11.2327487Z contiguous: bool, 2025-05-07T20:32:11.2327753Z compiled: bool, 2025-05-07T20:32:11.2328078Z ) -> None: 2025-05-07T20:32:11.2328303Z torch.manual_seed(2025) 2025-05-07T20:32:11.2328551Z 2025-05-07T20:32:11.2328832Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:11.2329200Z 2025-05-07T20:32:11.2329395Z x_sign = torch.sign(x) 2025-05-07T20:32:11.2329699Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:11.2330026Z x = x_sign * x_clamp 2025-05-07T20:32:11.2330271Z x0 = x[:, :D] 2025-05-07T20:32:11.2330499Z x1 = x[:, D:] 2025-05-07T20:32:11.2330714Z 2025-05-07T20:32:11.2330898Z if contiguous: 2025-05-07T20:32:11.2331138Z x0 = x0.contiguous() 2025-05-07T20:32:11.2331418Z x1 = x1.contiguous() 2025-05-07T20:32:11.2331678Z 2025-05-07T20:32:11.2331872Z if scale_ub is not None: 2025-05-07T20:32:11.2332157Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:11.2332511Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:11.2332831Z ) 2025-05-07T20:32:11.2333024Z else: 2025-05-07T20:32:11.2333239Z scale_ub_tensor = None 2025-05-07T20:32:11.2333493Z 2025-05-07T20:32:11.2333724Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:11.2334052Z op = silu_mul_quant 2025-05-07T20:32:11.2334307Z if compiled: 2025-05-07T20:32:11.2334565Z op = torch.compile(op) 2025-05-07T20:32:11.2334881Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.2335169Z 2025-05-07T20:32:11.2335372Z > y_fp8, y_scale = fn() 2025-05-07T20:32:11.2335543Z 2025-05-07T20:32:11.2335652Z moe/activation_test.py:117: 2025-05-07T20:32:11.2335971Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.2336319Z moe/activation_test.py:115: in fn 2025-05-07T20:32:11.2336617Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.2337312Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:11.2337908Z return fn(*args, **kwargs) 2025-05-07T20:32:11.2338616Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:11.2339371Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:11.2339991Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:11.2340745Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:11.2341462Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:11.2342031Z kernel = self.compile( 2025-05-07T20:32:11.2342611Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:11.2343326Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:11.2343751Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.2344002Z 2025-05-07T20:32:11.2344217Z self = 2025-05-07T20:32:11.2345392Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:11.2346900Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1cef9944c0>} 2025-05-07T20:32:11.2348460Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:11.2349573Z context = 2025-05-07T20:32:11.2349970Z 2025-05-07T20:32:11.2350143Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:11.2350697Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:11.2351196Z module_map=module_map) 2025-05-07T20:32:11.2351569Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:11.2351944Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:11.2352217Z E ^ 2025-05-07T20:32:11.2352713Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:11.2353224Z 2025-05-07T20:32:11.2353677Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:11.2354239Z 2025-05-07T20:32:11.4425159Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:11.4426336Z self=, 2025-05-07T20:32:11.4427188Z T=16384, 2025-05-07T20:32:11.4427564Z D=5120, 2025-05-07T20:32:11.4427935Z scale_ub=1200.0, 2025-05-07T20:32:11.4428382Z contiguous=False, 2025-05-07T20:32:11.4428816Z compiled=False, 2025-05-07T20:32:11.4429212Z ) 2025-05-07T20:32:11.4429613Z self = 2025-05-07T20:32:11.4430292Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:11.4430597Z 2025-05-07T20:32:11.4430673Z @given( 2025-05-07T20:32:11.4430904Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:11.4431236Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:11.4431543Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:11.4431883Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:11.4432444Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:11.4432736Z ) 2025-05-07T20:32:11.4433095Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:11.4433559Z def test_silu_mul_quant( 2025-05-07T20:32:11.4433795Z self, 2025-05-07T20:32:11.4433985Z T: int, 2025-05-07T20:32:11.4434183Z D: int, 2025-05-07T20:32:11.4434394Z scale_ub: Optional[float], 2025-05-07T20:32:11.4434670Z contiguous: bool, 2025-05-07T20:32:11.4434911Z compiled: bool, 2025-05-07T20:32:11.4435130Z ) -> None: 2025-05-07T20:32:11.4435349Z torch.manual_seed(2025) 2025-05-07T20:32:11.4435592Z 2025-05-07T20:32:11.4435869Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:11.4436224Z 2025-05-07T20:32:11.4436417Z x_sign = torch.sign(x) 2025-05-07T20:32:11.4436714Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:11.4437034Z x = x_sign * x_clamp 2025-05-07T20:32:11.4437281Z x0 = x[:, :D] 2025-05-07T20:32:11.4437503Z x1 = x[:, D:] 2025-05-07T20:32:11.4437707Z 2025-05-07T20:32:11.4437896Z if contiguous: 2025-05-07T20:32:11.4438136Z x0 = x0.contiguous() 2025-05-07T20:32:11.4438391Z x1 = x1.contiguous() 2025-05-07T20:32:11.4438640Z 2025-05-07T20:32:11.4438837Z if scale_ub is not None: 2025-05-07T20:32:11.4439109Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:11.4439458Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:11.4439777Z ) 2025-05-07T20:32:11.4439964Z else: 2025-05-07T20:32:11.4440176Z scale_ub_tensor = None 2025-05-07T20:32:11.4440606Z 2025-05-07T20:32:11.4440835Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:11.4441156Z op = silu_mul_quant 2025-05-07T20:32:11.4441414Z if compiled: 2025-05-07T20:32:11.4441671Z op = torch.compile(op) 2025-05-07T20:32:11.4441969Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.4442255Z 2025-05-07T20:32:11.4442450Z > y_fp8, y_scale = fn() 2025-05-07T20:32:11.4442619Z 2025-05-07T20:32:11.4442715Z moe/activation_test.py:117: 2025-05-07T20:32:11.4443018Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.4443362Z moe/activation_test.py:115: in fn 2025-05-07T20:32:11.4443639Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.4444383Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:11.4445133Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:11.4445705Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:11.4446430Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:11.4447147Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:11.4447713Z kernel = self.compile( 2025-05-07T20:32:11.4448288Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:11.4448983Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:11.4449396Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.4449637Z 2025-05-07T20:32:11.4449870Z self = 2025-05-07T20:32:11.4451075Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:11.4452914Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1cef994820>} 2025-05-07T20:32:11.4454389Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:11.4455499Z context = 2025-05-07T20:32:11.4455804Z 2025-05-07T20:32:11.4455979Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:11.4456525Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:11.4457008Z module_map=module_map) 2025-05-07T20:32:11.4457389Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:11.4457750Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:11.4458008Z E ^ 2025-05-07T20:32:11.4458506Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:11.4459002Z 2025-05-07T20:32:11.4459459Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:11.4460015Z 2025-05-07T20:32:11.4460123Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:11.4460544Z self=, 2025-05-07T20:32:11.4460966Z T=16384, 2025-05-07T20:32:11.4461164Z D=5120, 2025-05-07T20:32:11.4461354Z scale_ub=1200.0, 2025-05-07T20:32:11.4461575Z contiguous=True, 2025-05-07T20:32:11.4461796Z compiled=True, 2025-05-07T20:32:11.4461992Z ) 2025-05-07T20:32:11.4462405Z self = 2025-05-07T20:32:11.4462929Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:11.4463222Z 2025-05-07T20:32:11.4463311Z @given( 2025-05-07T20:32:11.4463534Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:11.4463858Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:11.4464176Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:11.4464512Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:11.4464854Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:11.4465144Z ) 2025-05-07T20:32:11.4465498Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:11.4465962Z def test_silu_mul_quant( 2025-05-07T20:32:11.4466205Z self, 2025-05-07T20:32:11.4466390Z T: int, 2025-05-07T20:32:11.4466586Z D: int, 2025-05-07T20:32:11.4466802Z scale_ub: Optional[float], 2025-05-07T20:32:11.4467080Z contiguous: bool, 2025-05-07T20:32:11.4467320Z compiled: bool, 2025-05-07T20:32:11.4467543Z ) -> None: 2025-05-07T20:32:11.4467767Z torch.manual_seed(2025) 2025-05-07T20:32:11.4468003Z 2025-05-07T20:32:11.4468277Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:11.4468636Z 2025-05-07T20:32:11.4468822Z x_sign = torch.sign(x) 2025-05-07T20:32:11.4469118Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:11.4469436Z x = x_sign * x_clamp 2025-05-07T20:32:11.4469777Z x0 = x[:, :D] 2025-05-07T20:32:11.4470000Z x1 = x[:, D:] 2025-05-07T20:32:11.4470207Z 2025-05-07T20:32:11.4470388Z if contiguous: 2025-05-07T20:32:11.4470618Z x0 = x0.contiguous() 2025-05-07T20:32:11.4470884Z x1 = x1.contiguous() 2025-05-07T20:32:11.4471119Z 2025-05-07T20:32:11.4471311Z if scale_ub is not None: 2025-05-07T20:32:11.4471591Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:11.4471927Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:11.4472244Z ) 2025-05-07T20:32:11.4472527Z else: 2025-05-07T20:32:11.4472741Z scale_ub_tensor = None 2025-05-07T20:32:11.4472990Z 2025-05-07T20:32:11.4473221Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:11.4473546Z op = silu_mul_quant 2025-05-07T20:32:11.4473792Z if compiled: 2025-05-07T20:32:11.4474043Z op = torch.compile(op) 2025-05-07T20:32:11.4474347Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.4474626Z 2025-05-07T20:32:11.4474820Z > y_fp8, y_scale = fn() 2025-05-07T20:32:11.4474987Z 2025-05-07T20:32:11.4475090Z moe/activation_test.py:117: 2025-05-07T20:32:11.4475387Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.4475735Z moe/activation_test.py:115: in fn 2025-05-07T20:32:11.4476024Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.4476613Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:11.4477212Z return fn(*args, **kwargs) 2025-05-07T20:32:11.4477922Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:11.4478669Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:11.4479232Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:11.4480011Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:11.4480715Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:11.4481283Z kernel = self.compile( 2025-05-07T20:32:11.4481849Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:11.4482637Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:11.4483326Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.4483567Z 2025-05-07T20:32:11.4483784Z self = 2025-05-07T20:32:11.4484950Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:11.4486453Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1cef814e50>} 2025-05-07T20:32:11.4487918Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:11.4489029Z context = 2025-05-07T20:32:11.4489336Z 2025-05-07T20:32:11.4489506Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:11.4490049Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:11.4490540Z module_map=module_map) 2025-05-07T20:32:11.4490918Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:11.4491275Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:11.4491559Z E ^ 2025-05-07T20:32:11.4492051Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:11.4492541Z 2025-05-07T20:32:11.4492989Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:11.4493554Z 2025-05-07T20:32:11.8806082Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:11.8815127Z self=, 2025-05-07T20:32:11.8815726Z T=16384, 2025-05-07T20:32:11.8815940Z D=5120, 2025-05-07T20:32:11.8816170Z scale_ub=None, 2025-05-07T20:32:11.8816395Z contiguous=False, 2025-05-07T20:32:11.8816627Z compiled=True, 2025-05-07T20:32:11.8816848Z ) 2025-05-07T20:32:11.8817190Z self = 2025-05-07T20:32:11.8817717Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:11.8818023Z 2025-05-07T20:32:11.8818109Z @given( 2025-05-07T20:32:11.8818347Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:11.8818668Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:11.8818989Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:11.8819345Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:11.8819690Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:11.8819986Z ) 2025-05-07T20:32:11.8820357Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:11.8820827Z def test_silu_mul_quant( 2025-05-07T20:32:11.8821071Z self, 2025-05-07T20:32:11.8821270Z T: int, 2025-05-07T20:32:11.8821471Z D: int, 2025-05-07T20:32:11.8821687Z scale_ub: Optional[float], 2025-05-07T20:32:11.8821972Z contiguous: bool, 2025-05-07T20:32:11.8822218Z compiled: bool, 2025-05-07T20:32:11.8822443Z ) -> None: 2025-05-07T20:32:11.8822662Z torch.manual_seed(2025) 2025-05-07T20:32:11.8822912Z 2025-05-07T20:32:11.8823181Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:11.8823541Z 2025-05-07T20:32:11.8823740Z x_sign = torch.sign(x) 2025-05-07T20:32:11.8824208Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:11.8824530Z x = x_sign * x_clamp 2025-05-07T20:32:11.8824776Z x0 = x[:, :D] 2025-05-07T20:32:11.8824991Z x1 = x[:, D:] 2025-05-07T20:32:11.8825199Z 2025-05-07T20:32:11.8825383Z if contiguous: 2025-05-07T20:32:11.8825617Z x0 = x0.contiguous() 2025-05-07T20:32:11.8825871Z x1 = x1.contiguous() 2025-05-07T20:32:11.8826116Z 2025-05-07T20:32:11.8826310Z if scale_ub is not None: 2025-05-07T20:32:11.8826583Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:11.8826931Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:11.8827246Z ) 2025-05-07T20:32:11.8827440Z else: 2025-05-07T20:32:11.8827644Z scale_ub_tensor = None 2025-05-07T20:32:11.8827892Z 2025-05-07T20:32:11.8828117Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:11.8828436Z op = silu_mul_quant 2025-05-07T20:32:11.8828692Z if compiled: 2025-05-07T20:32:11.8828943Z op = torch.compile(op) 2025-05-07T20:32:11.8829240Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.8829527Z 2025-05-07T20:32:11.8829886Z > y_fp8, y_scale = fn() 2025-05-07T20:32:11.8830054Z 2025-05-07T20:32:11.8830150Z moe/activation_test.py:117: 2025-05-07T20:32:11.8830452Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.8830796Z moe/activation_test.py:115: in fn 2025-05-07T20:32:11.8831077Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.8831671Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:11.8832269Z return fn(*args, **kwargs) 2025-05-07T20:32:11.8832974Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:11.8833720Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:11.8834284Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:11.8835147Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:11.8835856Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:11.8836416Z kernel = self.compile( 2025-05-07T20:32:11.8836988Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:11.8837684Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:11.8838089Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.8838337Z 2025-05-07T20:32:11.8838551Z self = 2025-05-07T20:32:11.8839723Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:11.8841245Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1cefa029d0>} 2025-05-07T20:32:11.8842718Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:11.8843824Z context = 2025-05-07T20:32:11.8844140Z 2025-05-07T20:32:11.8844309Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:11.8844862Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:11.8845437Z module_map=module_map) 2025-05-07T20:32:11.8845809Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:11.8846174Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:11.8846444Z E ^ 2025-05-07T20:32:11.8846931Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:11.8847428Z 2025-05-07T20:32:11.8847880Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:11.8848445Z 2025-05-07T20:32:11.8848548Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:11.8848977Z self=, 2025-05-07T20:32:11.8849394Z T=2048, 2025-05-07T20:32:11.8849582Z D=5120, 2025-05-07T20:32:11.8849777Z scale_ub=None, 2025-05-07T20:32:11.8849986Z contiguous=False, 2025-05-07T20:32:11.8850211Z compiled=True, 2025-05-07T20:32:11.8850420Z ) 2025-05-07T20:32:12.0061982Z self = 2025-05-07T20:32:12.0062742Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:12.0063040Z 2025-05-07T20:32:12.0063129Z @given( 2025-05-07T20:32:12.0063366Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:12.0063699Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:12.0064023Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:12.0064361Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:12.0064706Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:12.0065007Z ) 2025-05-07T20:32:12.0065377Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:12.0065848Z def test_silu_mul_quant( 2025-05-07T20:32:12.0066098Z self, 2025-05-07T20:32:12.0066303Z T: int, 2025-05-07T20:32:12.0066514Z D: int, 2025-05-07T20:32:12.0066740Z scale_ub: Optional[float], 2025-05-07T20:32:12.0067024Z contiguous: bool, 2025-05-07T20:32:12.0067266Z compiled: bool, 2025-05-07T20:32:12.0067504Z ) -> None: 2025-05-07T20:32:12.0068015Z torch.manual_seed(2025) 2025-05-07T20:32:12.0068268Z 2025-05-07T20:32:12.0068549Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:12.0068909Z 2025-05-07T20:32:12.0069105Z x_sign = torch.sign(x) 2025-05-07T20:32:12.0069406Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:12.0069881Z x = x_sign * x_clamp 2025-05-07T20:32:12.0070126Z x0 = x[:, :D] 2025-05-07T20:32:12.0070359Z x1 = x[:, D:] 2025-05-07T20:32:12.0070579Z 2025-05-07T20:32:12.0070761Z if contiguous: 2025-05-07T20:32:12.0071000Z x0 = x0.contiguous() 2025-05-07T20:32:12.0071269Z x1 = x1.contiguous() 2025-05-07T20:32:12.0071518Z 2025-05-07T20:32:12.0071713Z if scale_ub is not None: 2025-05-07T20:32:12.0071997Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:12.0072348Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:12.0072670Z ) 2025-05-07T20:32:12.0072871Z else: 2025-05-07T20:32:12.0073084Z scale_ub_tensor = None 2025-05-07T20:32:12.0073339Z 2025-05-07T20:32:12.0073577Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:12.0073908Z op = silu_mul_quant 2025-05-07T20:32:12.0074160Z if compiled: 2025-05-07T20:32:12.0074414Z op = torch.compile(op) 2025-05-07T20:32:12.0074722Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.0075003Z 2025-05-07T20:32:12.0075198Z > y_fp8, y_scale = fn() 2025-05-07T20:32:12.0075365Z 2025-05-07T20:32:12.0075472Z moe/activation_test.py:117: 2025-05-07T20:32:12.0075777Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.0076282Z moe/activation_test.py:115: in fn 2025-05-07T20:32:12.0076570Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.0077172Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:12.0077769Z return fn(*args, **kwargs) 2025-05-07T20:32:12.0078478Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:12.0079219Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:12.0079787Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:12.0080513Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:12.0081220Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:12.0081792Z kernel = self.compile( 2025-05-07T20:32:12.0082365Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:12.0083349Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:12.0083764Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.0084006Z 2025-05-07T20:32:12.0084227Z self = 2025-05-07T20:32:12.0085391Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:12.0086900Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1cef8b3550>} 2025-05-07T20:32:12.0088365Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:12.0089604Z context = 2025-05-07T20:32:12.0089914Z 2025-05-07T20:32:12.0090092Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:12.0090633Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:12.0091123Z module_map=module_map) 2025-05-07T20:32:12.0091503Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:12.0091863Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:12.0092134Z E ^ 2025-05-07T20:32:12.0092626Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:12.0093115Z 2025-05-07T20:32:12.0093572Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:12.0094131Z 2025-05-07T20:32:12.0094235Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:12.0094671Z self=, 2025-05-07T20:32:12.0095100Z T=2048, 2025-05-07T20:32:12.0095286Z D=5120, 2025-05-07T20:32:12.0095484Z scale_ub=1200.0, 2025-05-07T20:32:12.0095717Z contiguous=False, 2025-05-07T20:32:12.0095942Z compiled=True, 2025-05-07T20:32:12.0096157Z ) 2025-05-07T20:32:12.0096489Z self = 2025-05-07T20:32:12.0097012Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:12.0097303Z 2025-05-07T20:32:12.0097382Z @given( 2025-05-07T20:32:12.0097618Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:12.0097961Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:12.0098481Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:12.0098830Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:12.0099172Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:12.0099481Z ) 2025-05-07T20:32:12.0099837Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:12.0100306Z def test_silu_mul_quant( 2025-05-07T20:32:12.0100556Z self, 2025-05-07T20:32:12.0100746Z T: int, 2025-05-07T20:32:12.0100950Z D: int, 2025-05-07T20:32:12.0101179Z scale_ub: Optional[float], 2025-05-07T20:32:12.0101454Z contiguous: bool, 2025-05-07T20:32:12.0101703Z compiled: bool, 2025-05-07T20:32:12.0101935Z ) -> None: 2025-05-07T20:32:12.0102149Z torch.manual_seed(2025) 2025-05-07T20:32:12.0102404Z 2025-05-07T20:32:12.0102690Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:12.0103043Z 2025-05-07T20:32:12.0103251Z x_sign = torch.sign(x) 2025-05-07T20:32:12.0103551Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:12.0103874Z x = x_sign * x_clamp 2025-05-07T20:32:12.0104127Z x0 = x[:, :D] 2025-05-07T20:32:12.0104351Z x1 = x[:, D:] 2025-05-07T20:32:12.0104562Z 2025-05-07T20:32:12.0104753Z if contiguous: 2025-05-07T20:32:12.0104991Z x0 = x0.contiguous() 2025-05-07T20:32:12.0105259Z x1 = x1.contiguous() 2025-05-07T20:32:12.0105504Z 2025-05-07T20:32:12.0105705Z if scale_ub is not None: 2025-05-07T20:32:12.0105985Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:12.0106326Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:12.0106650Z ) 2025-05-07T20:32:12.0106850Z else: 2025-05-07T20:32:12.0107059Z scale_ub_tensor = None 2025-05-07T20:32:12.0107321Z 2025-05-07T20:32:12.0107562Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:12.0107897Z op = silu_mul_quant 2025-05-07T20:32:12.0108156Z if compiled: 2025-05-07T20:32:12.0108409Z op = torch.compile(op) 2025-05-07T20:32:12.0108802Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.0109089Z 2025-05-07T20:32:12.0109288Z > y_fp8, y_scale = fn() 2025-05-07T20:32:12.0109460Z 2025-05-07T20:32:12.0109565Z moe/activation_test.py:117: 2025-05-07T20:32:12.0109965Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.0110319Z moe/activation_test.py:115: in fn 2025-05-07T20:32:12.0110612Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.0111198Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:12.0111797Z return fn(*args, **kwargs) 2025-05-07T20:32:12.0112508Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:12.0113261Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:12.0113827Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:12.0114564Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:12.0115277Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:12.0115842Z kernel = self.compile( 2025-05-07T20:32:12.0116424Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:12.0117129Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:12.0117547Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.0117788Z 2025-05-07T20:32:12.0118001Z self = 2025-05-07T20:32:12.0119753Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:12.0121252Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1cef654310>} 2025-05-07T20:32:12.0122714Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:12.0123824Z context = 2025-05-07T20:32:12.0124127Z 2025-05-07T20:32:12.0124299Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:12.0124852Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:12.0125349Z module_map=module_map) 2025-05-07T20:32:12.0125722Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:12.0126095Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:12.0126367Z E ^ 2025-05-07T20:32:12.0126862Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:12.0127353Z 2025-05-07T20:32:12.0127803Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:12.0128364Z 2025-05-07T20:32:12.2383174Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:12.2383630Z self=, 2025-05-07T20:32:12.2384171Z T=4096, 2025-05-07T20:32:12.2384434Z D=5120, 2025-05-07T20:32:12.2384625Z scale_ub=1200.0, 2025-05-07T20:32:12.2384850Z contiguous=True, 2025-05-07T20:32:12.2385090Z compiled=True, 2025-05-07T20:32:12.2385292Z ) 2025-05-07T20:32:12.2385617Z self = 2025-05-07T20:32:12.2386431Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:12.2386723Z 2025-05-07T20:32:12.2386801Z @given( 2025-05-07T20:32:12.2387040Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:12.2387360Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:12.2387668Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:12.2388012Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:12.2388351Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:12.2388650Z ) 2025-05-07T20:32:12.2389007Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:12.2389525Z def test_silu_mul_quant( 2025-05-07T20:32:12.2389888Z self, 2025-05-07T20:32:12.2390088Z T: int, 2025-05-07T20:32:12.2390287Z D: int, 2025-05-07T20:32:12.2390507Z scale_ub: Optional[float], 2025-05-07T20:32:12.2390778Z contiguous: bool, 2025-05-07T20:32:12.2391034Z compiled: bool, 2025-05-07T20:32:12.2391265Z ) -> None: 2025-05-07T20:32:12.2391480Z torch.manual_seed(2025) 2025-05-07T20:32:12.2391730Z 2025-05-07T20:32:12.2392009Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:12.2392365Z 2025-05-07T20:32:12.2392563Z x_sign = torch.sign(x) 2025-05-07T20:32:12.2392863Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:12.2393176Z x = x_sign * x_clamp 2025-05-07T20:32:12.2393421Z x0 = x[:, :D] 2025-05-07T20:32:12.2393639Z x1 = x[:, D:] 2025-05-07T20:32:12.2393853Z 2025-05-07T20:32:12.2394030Z if contiguous: 2025-05-07T20:32:12.2394262Z x0 = x0.contiguous() 2025-05-07T20:32:12.2394529Z x1 = x1.contiguous() 2025-05-07T20:32:12.2394931Z 2025-05-07T20:32:12.2395129Z if scale_ub is not None: 2025-05-07T20:32:12.2395408Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:12.2395749Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:12.2396068Z ) 2025-05-07T20:32:12.2396255Z else: 2025-05-07T20:32:12.2396462Z scale_ub_tensor = None 2025-05-07T20:32:12.2396715Z 2025-05-07T20:32:12.2396949Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:12.2397264Z op = silu_mul_quant 2025-05-07T20:32:12.2397517Z if compiled: 2025-05-07T20:32:12.2397767Z op = torch.compile(op) 2025-05-07T20:32:12.2398068Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.2398350Z 2025-05-07T20:32:12.2398546Z > y_fp8, y_scale = fn() 2025-05-07T20:32:12.2398714Z 2025-05-07T20:32:12.2398816Z moe/activation_test.py:117: 2025-05-07T20:32:12.2399113Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.2399465Z moe/activation_test.py:115: in fn 2025-05-07T20:32:12.2399758Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.2400343Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:12.2400939Z return fn(*args, **kwargs) 2025-05-07T20:32:12.2401643Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:12.2402380Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:12.2402941Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:12.2403669Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:12.2404375Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:12.2404941Z kernel = self.compile( 2025-05-07T20:32:12.2405510Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:12.2406296Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:12.2406712Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.2406953Z 2025-05-07T20:32:12.2407166Z self = 2025-05-07T20:32:12.2408333Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:12.2409850Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1cef73e040>} 2025-05-07T20:32:12.2411317Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:12.2412430Z context = 2025-05-07T20:32:12.2412734Z 2025-05-07T20:32:12.2412903Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:12.2413456Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:12.2413945Z module_map=module_map) 2025-05-07T20:32:12.2414314Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:12.2414676Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:12.2414941Z E ^ 2025-05-07T20:32:12.2415431Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:12.2416011Z 2025-05-07T20:32:12.2416458Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:12.2417024Z 2025-05-07T20:32:12.2417132Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:12.2417561Z self=, 2025-05-07T20:32:12.2417978Z T=128, 2025-05-07T20:32:12.2418165Z D=5120, 2025-05-07T20:32:12.2418359Z scale_ub=1200.0, 2025-05-07T20:32:12.2418579Z contiguous=False, 2025-05-07T20:32:12.2418804Z compiled=True, 2025-05-07T20:32:12.2419007Z ) 2025-05-07T20:32:12.3764520Z self = 2025-05-07T20:32:12.3766032Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:12.3766813Z 2025-05-07T20:32:12.3767047Z @given( 2025-05-07T20:32:12.3767497Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:12.3768164Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:12.3768782Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:12.3769216Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:12.3769561Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:12.3769857Z ) 2025-05-07T20:32:12.3770211Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:12.3770678Z def test_silu_mul_quant( 2025-05-07T20:32:12.3770927Z self, 2025-05-07T20:32:12.3771123Z T: int, 2025-05-07T20:32:12.3771314Z D: int, 2025-05-07T20:32:12.3771536Z scale_ub: Optional[float], 2025-05-07T20:32:12.3771811Z contiguous: bool, 2025-05-07T20:32:12.3772044Z compiled: bool, 2025-05-07T20:32:12.3772268Z ) -> None: 2025-05-07T20:32:12.3772481Z torch.manual_seed(2025) 2025-05-07T20:32:12.3772718Z 2025-05-07T20:32:12.3772992Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:12.3773352Z 2025-05-07T20:32:12.3773539Z x_sign = torch.sign(x) 2025-05-07T20:32:12.3773828Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:12.3774433Z x = x_sign * x_clamp 2025-05-07T20:32:12.3774676Z x0 = x[:, :D] 2025-05-07T20:32:12.3774911Z x1 = x[:, D:] 2025-05-07T20:32:12.3775122Z 2025-05-07T20:32:12.3775303Z if contiguous: 2025-05-07T20:32:12.3775538Z x0 = x0.contiguous() 2025-05-07T20:32:12.3775801Z x1 = x1.contiguous() 2025-05-07T20:32:12.3784293Z 2025-05-07T20:32:12.3784544Z if scale_ub is not None: 2025-05-07T20:32:12.3784863Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:12.3785236Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:12.3785572Z ) 2025-05-07T20:32:12.3785784Z else: 2025-05-07T20:32:12.3786012Z scale_ub_tensor = None 2025-05-07T20:32:12.3786283Z 2025-05-07T20:32:12.3786546Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:12.3786895Z op = silu_mul_quant 2025-05-07T20:32:12.3787162Z if compiled: 2025-05-07T20:32:12.3787438Z op = torch.compile(op) 2025-05-07T20:32:12.3787758Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.3788053Z 2025-05-07T20:32:12.3788263Z > y_fp8, y_scale = fn() 2025-05-07T20:32:12.3788440Z 2025-05-07T20:32:12.3788554Z moe/activation_test.py:117: 2025-05-07T20:32:12.3788878Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.3789238Z moe/activation_test.py:115: in fn 2025-05-07T20:32:12.3789537Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.3790239Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:12.3790849Z return fn(*args, **kwargs) 2025-05-07T20:32:12.3791572Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:12.3792531Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:12.3793100Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:12.3793835Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:12.3794548Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:12.3795116Z kernel = self.compile( 2025-05-07T20:32:12.3795683Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:12.3796388Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:12.3796810Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.3797052Z 2025-05-07T20:32:12.3797273Z self = 2025-05-07T20:32:12.3798450Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:12.3799962Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1cef73eca0>} 2025-05-07T20:32:12.3801429Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:12.3802542Z context = 2025-05-07T20:32:12.3802845Z 2025-05-07T20:32:12.3803020Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:12.3803570Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:12.3804057Z module_map=module_map) 2025-05-07T20:32:12.3804557Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:12.3804927Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:12.3805183Z E ^ 2025-05-07T20:32:12.3805676Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:12.3806162Z 2025-05-07T20:32:12.3806615Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:12.3807169Z 2025-05-07T20:32:12.3807276Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:12.3807693Z self=, 2025-05-07T20:32:12.3808120Z T=16384, 2025-05-07T20:32:12.3808310Z D=7168, 2025-05-07T20:32:12.3808501Z scale_ub=1200.0, 2025-05-07T20:32:12.3808722Z contiguous=True, 2025-05-07T20:32:12.3808943Z compiled=True, 2025-05-07T20:32:12.3809140Z ) 2025-05-07T20:32:12.3809473Z self = 2025-05-07T20:32:12.3810000Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:12.3810293Z 2025-05-07T20:32:12.3810376Z @given( 2025-05-07T20:32:12.3810599Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:12.3810921Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:12.3811240Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:12.3811573Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:12.3811911Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:12.3812206Z ) 2025-05-07T20:32:12.3812560Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:12.3813025Z def test_silu_mul_quant( 2025-05-07T20:32:12.3813354Z self, 2025-05-07T20:32:12.3813541Z T: int, 2025-05-07T20:32:12.3813740Z D: int, 2025-05-07T20:32:12.3813960Z scale_ub: Optional[float], 2025-05-07T20:32:12.3814234Z contiguous: bool, 2025-05-07T20:32:12.3814475Z compiled: bool, 2025-05-07T20:32:12.3814697Z ) -> None: 2025-05-07T20:32:12.3814916Z torch.manual_seed(2025) 2025-05-07T20:32:12.3815156Z 2025-05-07T20:32:12.3815432Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:12.3815788Z 2025-05-07T20:32:12.3815974Z x_sign = torch.sign(x) 2025-05-07T20:32:12.3816271Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:12.3816592Z x = x_sign * x_clamp 2025-05-07T20:32:12.3816830Z x0 = x[:, :D] 2025-05-07T20:32:12.3817050Z x1 = x[:, D:] 2025-05-07T20:32:12.3817260Z 2025-05-07T20:32:12.3817441Z if contiguous: 2025-05-07T20:32:12.3817675Z x0 = x0.contiguous() 2025-05-07T20:32:12.3817945Z x1 = x1.contiguous() 2025-05-07T20:32:12.3818185Z 2025-05-07T20:32:12.3818377Z if scale_ub is not None: 2025-05-07T20:32:12.3818661Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:12.3818994Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:12.3819313Z ) 2025-05-07T20:32:12.3819510Z else: 2025-05-07T20:32:12.3819712Z scale_ub_tensor = None 2025-05-07T20:32:12.3819965Z 2025-05-07T20:32:12.3820187Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:12.3820512Z op = silu_mul_quant 2025-05-07T20:32:12.3820759Z if compiled: 2025-05-07T20:32:12.3821006Z op = torch.compile(op) 2025-05-07T20:32:12.3821306Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.3821580Z 2025-05-07T20:32:12.3821766Z > y_fp8, y_scale = fn() 2025-05-07T20:32:12.3821928Z 2025-05-07T20:32:12.3822030Z moe/activation_test.py:117: 2025-05-07T20:32:12.3822329Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.3822674Z moe/activation_test.py:115: in fn 2025-05-07T20:32:12.3823041Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.3823633Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:12.3824224Z return fn(*args, **kwargs) 2025-05-07T20:32:12.3824933Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:12.3825673Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:12.3826230Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:12.3826960Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:12.3827670Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:12.3828240Z kernel = self.compile( 2025-05-07T20:32:12.3828807Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:12.3829504Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:12.3830061Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.3830302Z 2025-05-07T20:32:12.3830515Z self = 2025-05-07T20:32:12.3831686Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:12.3833193Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1cef7fea60>} 2025-05-07T20:32:12.3834750Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:12.3835860Z context = 2025-05-07T20:32:12.3836166Z 2025-05-07T20:32:12.3836334Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:12.3836879Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:12.3837372Z module_map=module_map) 2025-05-07T20:32:12.3837748Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:12.3838109Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:12.3838377Z E ^ 2025-05-07T20:32:12.3838865Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:12.3839358Z 2025-05-07T20:32:12.3839847Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:12.3840426Z 2025-05-07T20:32:12.8707097Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:12.8708376Z self=, 2025-05-07T20:32:12.8709374Z T=16384, 2025-05-07T20:32:12.8709606Z D=5120, 2025-05-07T20:32:12.8709916Z scale_ub=1200.0, 2025-05-07T20:32:12.8710137Z contiguous=True, 2025-05-07T20:32:12.8710369Z compiled=False, 2025-05-07T20:32:12.8710586Z ) 2025-05-07T20:32:12.8711010Z self = 2025-05-07T20:32:12.8711740Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:12.8712062Z 2025-05-07T20:32:12.8712186Z @given( 2025-05-07T20:32:12.8712508Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:12.8712983Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:12.8713413Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:12.8714016Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:12.8714357Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:12.8714653Z ) 2025-05-07T20:32:12.8715015Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:12.8715479Z def test_silu_mul_quant( 2025-05-07T20:32:12.8715725Z self, 2025-05-07T20:32:12.8715917Z T: int, 2025-05-07T20:32:12.8716109Z D: int, 2025-05-07T20:32:12.8716331Z scale_ub: Optional[float], 2025-05-07T20:32:12.8716641Z contiguous: bool, 2025-05-07T20:32:12.8716885Z compiled: bool, 2025-05-07T20:32:12.8717112Z ) -> None: 2025-05-07T20:32:12.8717320Z torch.manual_seed(2025) 2025-05-07T20:32:12.8717567Z 2025-05-07T20:32:12.8717843Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:12.8718194Z 2025-05-07T20:32:12.8718389Z x_sign = torch.sign(x) 2025-05-07T20:32:12.8718692Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:12.8719004Z x = x_sign * x_clamp 2025-05-07T20:32:12.8719250Z x0 = x[:, :D] 2025-05-07T20:32:12.8719469Z x1 = x[:, D:] 2025-05-07T20:32:12.8719674Z 2025-05-07T20:32:12.8719859Z if contiguous: 2025-05-07T20:32:12.8720092Z x0 = x0.contiguous() 2025-05-07T20:32:12.8720355Z x1 = x1.contiguous() 2025-05-07T20:32:12.8720601Z 2025-05-07T20:32:12.8720794Z if scale_ub is not None: 2025-05-07T20:32:12.8721073Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:12.8721414Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:12.8721732Z ) 2025-05-07T20:32:12.8721931Z else: 2025-05-07T20:32:12.8722143Z scale_ub_tensor = None 2025-05-07T20:32:12.8722533Z 2025-05-07T20:32:12.8722765Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:12.8723083Z op = silu_mul_quant 2025-05-07T20:32:12.8723337Z if compiled: 2025-05-07T20:32:12.8723596Z op = torch.compile(op) 2025-05-07T20:32:12.8723894Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.8724179Z 2025-05-07T20:32:12.8724370Z > y_fp8, y_scale = fn() 2025-05-07T20:32:12.8724537Z 2025-05-07T20:32:12.8724644Z moe/activation_test.py:117: 2025-05-07T20:32:12.8724941Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.8725292Z moe/activation_test.py:115: in fn 2025-05-07T20:32:12.8725585Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.8726319Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:12.8727064Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:12.8727640Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:12.8728378Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:12.8729086Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:12.8729653Z kernel = self.compile( 2025-05-07T20:32:12.8730224Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:12.8730914Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:12.8731326Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.8731572Z 2025-05-07T20:32:12.8731784Z self = 2025-05-07T20:32:12.8732953Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:12.8734563Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1cef6ec550>} 2025-05-07T20:32:12.8736026Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:12.8737131Z context = 2025-05-07T20:32:12.8737433Z 2025-05-07T20:32:12.8737607Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:12.8738155Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:12.8738641Z module_map=module_map) 2025-05-07T20:32:12.8739022Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:12.8739386Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:12.8739645Z E ^ 2025-05-07T20:32:12.8740140Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:12.8740630Z 2025-05-07T20:32:12.8741082Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:12.8741633Z 2025-05-07T20:32:12.8741739Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:12.8742158Z self=, 2025-05-07T20:32:12.8742581Z T=1, 2025-05-07T20:32:12.8742763Z D=7168, 2025-05-07T20:32:12.8742948Z scale_ub=1200.0, 2025-05-07T20:32:12.8743178Z contiguous=False, 2025-05-07T20:32:12.8743409Z compiled=False, 2025-05-07T20:32:12.8743694Z ) 2025-05-07T20:32:12.8744022Z self = 2025-05-07T20:32:12.8744535Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:12.8744819Z 2025-05-07T20:32:12.8744898Z @given( 2025-05-07T20:32:12.8745130Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:12.8745456Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:12.8745771Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:12.8746105Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:12.8746442Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:12.8746735Z ) 2025-05-07T20:32:12.8747091Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:12.8747557Z def test_silu_mul_quant( 2025-05-07T20:32:12.8747803Z self, 2025-05-07T20:32:12.8747988Z T: int, 2025-05-07T20:32:12.8748186Z D: int, 2025-05-07T20:32:12.8748415Z scale_ub: Optional[float], 2025-05-07T20:32:12.8748682Z contiguous: bool, 2025-05-07T20:32:12.8748926Z compiled: bool, 2025-05-07T20:32:12.8749149Z ) -> None: 2025-05-07T20:32:12.8749367Z torch.manual_seed(2025) 2025-05-07T20:32:12.8749611Z 2025-05-07T20:32:12.8749977Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:12.8750336Z 2025-05-07T20:32:12.8750524Z x_sign = torch.sign(x) 2025-05-07T20:32:12.8750818Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:12.8751135Z x = x_sign * x_clamp 2025-05-07T20:32:12.8751374Z x0 = x[:, :D] 2025-05-07T20:32:12.8751589Z x1 = x[:, D:] 2025-05-07T20:32:12.8751796Z 2025-05-07T20:32:12.8751968Z if contiguous: 2025-05-07T20:32:12.8752201Z x0 = x0.contiguous() 2025-05-07T20:32:12.8752464Z x1 = x1.contiguous() 2025-05-07T20:32:12.8752704Z 2025-05-07T20:32:12.8752896Z if scale_ub is not None: 2025-05-07T20:32:12.8753176Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:12.8753511Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:12.8753829Z ) 2025-05-07T20:32:12.8754106Z else: 2025-05-07T20:32:12.8754312Z scale_ub_tensor = None 2025-05-07T20:32:12.8754572Z 2025-05-07T20:32:12.8754800Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:12.8755125Z op = silu_mul_quant 2025-05-07T20:32:12.8755371Z if compiled: 2025-05-07T20:32:12.8755617Z op = torch.compile(op) 2025-05-07T20:32:12.8755920Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.8756197Z 2025-05-07T20:32:12.8756388Z > y_fp8, y_scale = fn() 2025-05-07T20:32:12.8756554Z 2025-05-07T20:32:12.8756654Z moe/activation_test.py:117: 2025-05-07T20:32:12.8756951Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.8757298Z moe/activation_test.py:115: in fn 2025-05-07T20:32:12.8757587Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.8758324Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:12.8759072Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:12.8759641Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:12.8760369Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:12.8761070Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:12.8761637Z kernel = self.compile( 2025-05-07T20:32:12.8762208Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:12.8762906Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:12.8763395Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.8763640Z 2025-05-07T20:32:12.8763860Z self = 2025-05-07T20:32:12.8765028Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:12.8766529Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1cef7fee50>} 2025-05-07T20:32:12.8767989Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:12.8769103Z context = 2025-05-07T20:32:12.8769420Z 2025-05-07T20:32:12.8769591Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:12.8770144Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:12.8770631Z module_map=module_map) 2025-05-07T20:32:12.8771010Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:12.8771371Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:12.8771630Z E ^ 2025-05-07T20:32:12.8772121Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:12.8772614Z 2025-05-07T20:32:12.8773061Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:12.8773613Z 2025-05-07T20:32:12.8773720Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:12.8774148Z self=, 2025-05-07T20:32:12.8774567Z T=4096, 2025-05-07T20:32:12.8774757Z D=7168, 2025-05-07T20:32:12.8774944Z scale_ub=1200.0, 2025-05-07T20:32:12.8775259Z contiguous=False, 2025-05-07T20:32:12.8775484Z compiled=True, 2025-05-07T20:32:12.8775685Z ) 2025-05-07T20:32:12.9962607Z self = 2025-05-07T20:32:12.9963308Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:12.9963637Z 2025-05-07T20:32:12.9963754Z @given( 2025-05-07T20:32:12.9964070Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:12.9964511Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:12.9964872Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:12.9965212Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:12.9965554Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:12.9965861Z ) 2025-05-07T20:32:12.9966217Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:12.9966686Z def test_silu_mul_quant( 2025-05-07T20:32:12.9966945Z self, 2025-05-07T20:32:12.9967136Z T: int, 2025-05-07T20:32:12.9967340Z D: int, 2025-05-07T20:32:12.9967566Z scale_ub: Optional[float], 2025-05-07T20:32:12.9967845Z contiguous: bool, 2025-05-07T20:32:12.9968084Z compiled: bool, 2025-05-07T20:32:12.9968313Z ) -> None: 2025-05-07T20:32:12.9968533Z torch.manual_seed(2025) 2025-05-07T20:32:12.9968774Z 2025-05-07T20:32:12.9969055Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:12.9969421Z 2025-05-07T20:32:12.9969614Z x_sign = torch.sign(x) 2025-05-07T20:32:12.9969915Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:12.9970240Z x = x_sign * x_clamp 2025-05-07T20:32:12.9970484Z x0 = x[:, :D] 2025-05-07T20:32:12.9970884Z x1 = x[:, D:] 2025-05-07T20:32:12.9971095Z 2025-05-07T20:32:12.9971276Z if contiguous: 2025-05-07T20:32:12.9971506Z x0 = x0.contiguous() 2025-05-07T20:32:12.9971773Z x1 = x1.contiguous() 2025-05-07T20:32:12.9972006Z 2025-05-07T20:32:12.9972196Z if scale_ub is not None: 2025-05-07T20:32:12.9972478Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:12.9972821Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:12.9973130Z ) 2025-05-07T20:32:12.9973325Z else: 2025-05-07T20:32:12.9973537Z scale_ub_tensor = None 2025-05-07T20:32:12.9973786Z 2025-05-07T20:32:12.9974014Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:12.9974337Z op = silu_mul_quant 2025-05-07T20:32:12.9974583Z if compiled: 2025-05-07T20:32:12.9974834Z op = torch.compile(op) 2025-05-07T20:32:12.9975135Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.9975417Z 2025-05-07T20:32:12.9975608Z > y_fp8, y_scale = fn() 2025-05-07T20:32:12.9975773Z 2025-05-07T20:32:12.9975883Z moe/activation_test.py:117: 2025-05-07T20:32:12.9976180Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.9976526Z moe/activation_test.py:115: in fn 2025-05-07T20:32:12.9976809Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.9977396Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:12.9977989Z return fn(*args, **kwargs) 2025-05-07T20:32:12.9978695Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:12.9979434Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:12.9979993Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:12.9980724Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:12.9981552Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:12.9982125Z kernel = self.compile( 2025-05-07T20:32:12.9982691Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:12.9983571Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:12.9983990Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.9984233Z 2025-05-07T20:32:12.9984455Z self = 2025-05-07T20:32:12.9985625Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:12.9987139Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1cef41eee0>} 2025-05-07T20:32:12.9988609Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:12.9989801Z context = 2025-05-07T20:32:12.9996555Z 2025-05-07T20:32:12.9996763Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:12.9997327Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:12.9997831Z module_map=module_map) 2025-05-07T20:32:12.9998213Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:12.9998755Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:12.9999022Z E ^ 2025-05-07T20:32:12.9999531Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:13.0000028Z 2025-05-07T20:32:13.0000495Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:13.0001057Z 2025-05-07T20:32:13.0001171Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:13.0001603Z self=, 2025-05-07T20:32:13.0002035Z T=128, 2025-05-07T20:32:13.0002235Z D=7168, 2025-05-07T20:32:13.0002430Z scale_ub=1200.0, 2025-05-07T20:32:13.0002684Z contiguous=False, 2025-05-07T20:32:13.0002918Z compiled=True, 2025-05-07T20:32:13.0003134Z ) 2025-05-07T20:32:13.0003459Z self = 2025-05-07T20:32:13.0003993Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:13.0004284Z 2025-05-07T20:32:13.0004372Z @given( 2025-05-07T20:32:13.0004604Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:13.0004936Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:13.0005257Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:13.0005606Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:13.0005945Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:13.0006249Z ) 2025-05-07T20:32:13.0006618Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:13.0007085Z def test_silu_mul_quant( 2025-05-07T20:32:13.0007338Z self, 2025-05-07T20:32:13.0007540Z T: int, 2025-05-07T20:32:13.0007738Z D: int, 2025-05-07T20:32:13.0007966Z scale_ub: Optional[float], 2025-05-07T20:32:13.0008244Z contiguous: bool, 2025-05-07T20:32:13.0008492Z compiled: bool, 2025-05-07T20:32:13.0008723Z ) -> None: 2025-05-07T20:32:13.0008949Z torch.manual_seed(2025) 2025-05-07T20:32:13.0009199Z 2025-05-07T20:32:13.0009598Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:13.0009959Z 2025-05-07T20:32:13.0010151Z x_sign = torch.sign(x) 2025-05-07T20:32:13.0010447Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:13.0010765Z x = x_sign * x_clamp 2025-05-07T20:32:13.0011009Z x0 = x[:, :D] 2025-05-07T20:32:13.0011229Z x1 = x[:, D:] 2025-05-07T20:32:13.0011435Z 2025-05-07T20:32:13.0011622Z if contiguous: 2025-05-07T20:32:13.0011855Z x0 = x0.contiguous() 2025-05-07T20:32:13.0012112Z x1 = x1.contiguous() 2025-05-07T20:32:13.0012360Z 2025-05-07T20:32:13.0012552Z if scale_ub is not None: 2025-05-07T20:32:13.0012824Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:13.0013169Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:13.0013497Z ) 2025-05-07T20:32:13.0013684Z else: 2025-05-07T20:32:13.0013893Z scale_ub_tensor = None 2025-05-07T20:32:13.0014155Z 2025-05-07T20:32:13.0014387Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:13.0014714Z op = silu_mul_quant 2025-05-07T20:32:13.0014970Z if compiled: 2025-05-07T20:32:13.0015224Z op = torch.compile(op) 2025-05-07T20:32:13.0015525Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:13.0015808Z 2025-05-07T20:32:13.0016004Z > y_fp8, y_scale = fn() 2025-05-07T20:32:13.0016169Z 2025-05-07T20:32:13.0016268Z moe/activation_test.py:117: 2025-05-07T20:32:13.0016572Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:13.0016919Z moe/activation_test.py:115: in fn 2025-05-07T20:32:13.0017201Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:13.0017883Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:13.0018487Z return fn(*args, **kwargs) 2025-05-07T20:32:13.0019199Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:13.0019937Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:13.0020504Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:13.0021238Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:13.0021947Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:13.0022516Z kernel = self.compile( 2025-05-07T20:32:13.0023092Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:13.0023803Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:13.0024216Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:13.0024463Z 2025-05-07T20:32:13.0024680Z self = 2025-05-07T20:32:13.0025850Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:13.0027357Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1cef3c9af0>} 2025-05-07T20:32:13.0028818Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:13.0030125Z context = 2025-05-07T20:32:13.0030435Z 2025-05-07T20:32:13.0030609Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:13.0031243Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:13.0031732Z module_map=module_map) 2025-05-07T20:32:13.0032112Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:13.0032471Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:13.0032737Z E ^ 2025-05-07T20:32:13.0033225Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:13.0033720Z 2025-05-07T20:32:13.0034166Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:13.0034719Z 2025-05-07T20:32:13.1732864Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:13.1733484Z self=, 2025-05-07T20:32:13.1734070Z T=2048, 2025-05-07T20:32:13.1734335Z D=7168, 2025-05-07T20:32:13.1734605Z scale_ub=None, 2025-05-07T20:32:13.1734897Z contiguous=True, 2025-05-07T20:32:13.1735209Z compiled=True, 2025-05-07T20:32:13.1735494Z ) 2025-05-07T20:32:13.1735850Z self = 2025-05-07T20:32:13.1736373Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:13.1736659Z 2025-05-07T20:32:13.1736749Z @given( 2025-05-07T20:32:13.1736982Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:13.1737313Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:13.1737633Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:13.1738008Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:13.1738523Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:13.1738823Z ) 2025-05-07T20:32:13.1739179Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:13.1739653Z def test_silu_mul_quant( 2025-05-07T20:32:13.1739901Z self, 2025-05-07T20:32:13.1740095Z T: int, 2025-05-07T20:32:13.1740295Z D: int, 2025-05-07T20:32:13.1740516Z scale_ub: Optional[float], 2025-05-07T20:32:13.1740788Z contiguous: bool, 2025-05-07T20:32:13.1741030Z compiled: bool, 2025-05-07T20:32:13.1741256Z ) -> None: 2025-05-07T20:32:13.1741470Z torch.manual_seed(2025) 2025-05-07T20:32:13.1741718Z 2025-05-07T20:32:13.1741998Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:13.1742350Z 2025-05-07T20:32:13.1742551Z x_sign = torch.sign(x) 2025-05-07T20:32:13.1742851Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:13.1743170Z x = x_sign * x_clamp 2025-05-07T20:32:13.1743424Z x0 = x[:, :D] 2025-05-07T20:32:13.1743645Z x1 = x[:, D:] 2025-05-07T20:32:13.1743850Z 2025-05-07T20:32:13.1744039Z if contiguous: 2025-05-07T20:32:13.1744276Z x0 = x0.contiguous() 2025-05-07T20:32:13.1744544Z x1 = x1.contiguous() 2025-05-07T20:32:13.1744783Z 2025-05-07T20:32:13.1744977Z if scale_ub is not None: 2025-05-07T20:32:13.1745258Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:13.1745600Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:13.1745922Z ) 2025-05-07T20:32:13.1746123Z else: 2025-05-07T20:32:13.1746328Z scale_ub_tensor = None 2025-05-07T20:32:13.1746593Z 2025-05-07T20:32:13.1746827Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:13.1747144Z op = silu_mul_quant 2025-05-07T20:32:13.1747399Z if compiled: 2025-05-07T20:32:13.1747652Z op = torch.compile(op) 2025-05-07T20:32:13.1747955Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:13.1748240Z 2025-05-07T20:32:13.1748434Z > y_fp8, y_scale = fn() 2025-05-07T20:32:13.1748602Z 2025-05-07T20:32:13.1748828Z moe/activation_test.py:117: 2025-05-07T20:32:13.1749129Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:13.1749473Z moe/activation_test.py:115: in fn 2025-05-07T20:32:13.1749875Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:13.1750459Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:13.1751061Z return fn(*args, **kwargs) 2025-05-07T20:32:13.1751768Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:13.1752517Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:13.1753081Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:13.1753809Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:13.1754525Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:13.1755088Z kernel = self.compile( 2025-05-07T20:32:13.1755660Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:13.1756360Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:13.1756778Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:13.1757021Z 2025-05-07T20:32:13.1757234Z self = 2025-05-07T20:32:13.1758409Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:13.1760001Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1cef4c68b0>} 2025-05-07T20:32:13.1761460Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:13.1762558Z context = 2025-05-07T20:32:13.1762864Z 2025-05-07T20:32:13.1763033Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:13.1763578Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:13.1764071Z module_map=module_map) 2025-05-07T20:32:13.1764436Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:13.1764801Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:13.1765064Z E ^ 2025-05-07T20:32:13.1765552Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:13.1766043Z 2025-05-07T20:32:13.1766487Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:13.1767044Z 2025-05-07T20:32:13.1767147Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:13.1767569Z self=, 2025-05-07T20:32:13.1767980Z T=16384, 2025-05-07T20:32:13.1768173Z D=5120, 2025-05-07T20:32:13.1768366Z scale_ub=None, 2025-05-07T20:32:13.1768578Z contiguous=False, 2025-05-07T20:32:13.1768804Z compiled=False, 2025-05-07T20:32:13.1769006Z ) 2025-05-07T20:32:13.1769325Z self = 2025-05-07T20:32:13.1769849Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:13.1770147Z 2025-05-07T20:32:13.1770223Z @given( 2025-05-07T20:32:13.1770564Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:13.1770889Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:13.1771199Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:13.1771529Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:13.1771863Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:13.1772157Z ) 2025-05-07T20:32:13.1772513Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:13.1772968Z def test_silu_mul_quant( 2025-05-07T20:32:13.1773206Z self, 2025-05-07T20:32:13.1773399Z T: int, 2025-05-07T20:32:13.1773588Z D: int, 2025-05-07T20:32:13.1773802Z scale_ub: Optional[float], 2025-05-07T20:32:13.1774077Z contiguous: bool, 2025-05-07T20:32:13.1774313Z compiled: bool, 2025-05-07T20:32:13.1774533Z ) -> None: 2025-05-07T20:32:13.1774749Z torch.manual_seed(2025) 2025-05-07T20:32:13.1774984Z 2025-05-07T20:32:13.1775259Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:13.1775609Z 2025-05-07T20:32:13.1775794Z x_sign = torch.sign(x) 2025-05-07T20:32:13.1776083Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:13.1778289Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 144.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 40.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:13.1780472Z 2025-05-07T20:32:13.1780588Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:13.1780809Z 2025-05-07T20:32:13.1780918Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:13.1781335Z self=, 2025-05-07T20:32:13.1781746Z T=4096, 2025-05-07T20:32:13.1781928Z D=7168, 2025-05-07T20:32:13.1782117Z scale_ub=1200.0, 2025-05-07T20:32:13.1782331Z contiguous=True, 2025-05-07T20:32:13.1782548Z compiled=True, 2025-05-07T20:32:13.1782932Z ) 2025-05-07T20:32:13.1783250Z self = 2025-05-07T20:32:13.1783766Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:13.1784052Z 2025-05-07T20:32:13.1784135Z @given( 2025-05-07T20:32:13.1784360Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:13.1784677Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:13.1784983Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:13.1785314Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:13.1785647Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:13.1785936Z ) 2025-05-07T20:32:13.1786288Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:13.1786746Z def test_silu_mul_quant( 2025-05-07T20:32:13.1786984Z self, 2025-05-07T20:32:13.1787174Z T: int, 2025-05-07T20:32:13.1787366Z D: int, 2025-05-07T20:32:13.1787584Z scale_ub: Optional[float], 2025-05-07T20:32:13.1787856Z contiguous: bool, 2025-05-07T20:32:13.1788091Z compiled: bool, 2025-05-07T20:32:13.1788313Z ) -> None: 2025-05-07T20:32:13.1788528Z torch.manual_seed(2025) 2025-05-07T20:32:13.1788769Z 2025-05-07T20:32:13.1789038Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:13.1789398Z 2025-05-07T20:32:13.1789589Z x_sign = torch.sign(x) 2025-05-07T20:32:13.1789971Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:13.1792284Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 136.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:13.1794355Z 2025-05-07T20:32:13.1794468Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:13.1794685Z 2025-05-07T20:32:13.1794788Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:13.1795206Z self=, 2025-05-07T20:32:13.1795625Z T=16384, 2025-05-07T20:32:13.1795815Z D=7168, 2025-05-07T20:32:13.1796001Z scale_ub=None, 2025-05-07T20:32:13.1796214Z contiguous=False, 2025-05-07T20:32:13.1796439Z compiled=False, 2025-05-07T20:32:13.1796639Z ) 2025-05-07T20:32:13.2837773Z self = 2025-05-07T20:32:13.2838487Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:13.2838790Z 2025-05-07T20:32:13.2838869Z @given( 2025-05-07T20:32:13.2839093Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:13.2839404Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:13.2839715Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:13.2840101Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:13.2840435Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:13.2840950Z ) 2025-05-07T20:32:13.2841307Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:13.2841774Z def test_silu_mul_quant( 2025-05-07T20:32:13.2842017Z self, 2025-05-07T20:32:13.2842205Z T: int, 2025-05-07T20:32:13.2842401Z D: int, 2025-05-07T20:32:13.2842611Z scale_ub: Optional[float], 2025-05-07T20:32:13.2842883Z contiguous: bool, 2025-05-07T20:32:13.2843121Z compiled: bool, 2025-05-07T20:32:13.2843335Z ) -> None: 2025-05-07T20:32:13.2843542Z torch.manual_seed(2025) 2025-05-07T20:32:13.2843780Z 2025-05-07T20:32:13.2844045Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:13.2846305Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 144.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 136.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:13.2848386Z 2025-05-07T20:32:13.2848503Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:13.2848726Z 2025-05-07T20:32:13.2848825Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:13.2849245Z self=, 2025-05-07T20:32:13.2849662Z T=2048, 2025-05-07T20:32:13.2849840Z D=7168, 2025-05-07T20:32:13.2850030Z scale_ub=1200.0, 2025-05-07T20:32:13.2850249Z contiguous=True, 2025-05-07T20:32:13.2850466Z compiled=True, 2025-05-07T20:32:13.2850659Z ) 2025-05-07T20:32:13.2850979Z self = 2025-05-07T20:32:13.2851502Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:13.2851793Z 2025-05-07T20:32:13.2851871Z @given( 2025-05-07T20:32:13.2852211Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:13.2852527Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:13.2852836Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:13.2853168Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:13.2853499Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:13.2853792Z ) 2025-05-07T20:32:13.2854152Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:13.2854606Z def test_silu_mul_quant( 2025-05-07T20:32:13.2854846Z self, 2025-05-07T20:32:13.2855039Z T: int, 2025-05-07T20:32:13.2855228Z D: int, 2025-05-07T20:32:13.2855439Z scale_ub: Optional[float], 2025-05-07T20:32:13.2855712Z contiguous: bool, 2025-05-07T20:32:13.2855952Z compiled: bool, 2025-05-07T20:32:13.2856164Z ) -> None: 2025-05-07T20:32:13.2856373Z torch.manual_seed(2025) 2025-05-07T20:32:13.2856616Z 2025-05-07T20:32:13.2856885Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:13.2857239Z 2025-05-07T20:32:13.2857426Z x_sign = torch.sign(x) 2025-05-07T20:32:13.2857713Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:13.2859944Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 80.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:13.2862083Z 2025-05-07T20:32:13.2862197Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:13.2862418Z 2025-05-07T20:32:13.2862519Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:13.2862937Z self=, 2025-05-07T20:32:13.2863352Z T=2048, 2025-05-07T20:32:13.2863533Z D=7168, 2025-05-07T20:32:13.2863720Z scale_ub=None, 2025-05-07T20:32:13.2863922Z contiguous=True, 2025-05-07T20:32:13.2864147Z compiled=False, 2025-05-07T20:32:13.2864342Z ) 2025-05-07T20:32:13.2864655Z self = 2025-05-07T20:32:13.2865166Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:13.2865448Z 2025-05-07T20:32:13.2865527Z @given( 2025-05-07T20:32:13.2865746Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:13.2866056Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:13.2866368Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:13.2866701Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:13.2867050Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:13.2867338Z ) 2025-05-07T20:32:13.2867693Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:13.2868147Z def test_silu_mul_quant( 2025-05-07T20:32:13.2868387Z self, 2025-05-07T20:32:13.2868574Z T: int, 2025-05-07T20:32:13.2868766Z D: int, 2025-05-07T20:32:13.2868975Z scale_ub: Optional[float], 2025-05-07T20:32:13.2869242Z contiguous: bool, 2025-05-07T20:32:13.2869478Z compiled: bool, 2025-05-07T20:32:13.2869864Z ) -> None: 2025-05-07T20:32:13.2870077Z torch.manual_seed(2025) 2025-05-07T20:32:13.2870314Z 2025-05-07T20:32:13.2870575Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:13.2870933Z 2025-05-07T20:32:13.2871120Z > x_sign = torch.sign(x) 2025-05-07T20:32:13.2873322Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 80.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:13.2875373Z 2025-05-07T20:32:13.2875487Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:32:13.2875709Z 2025-05-07T20:32:13.2875809Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:13.2876225Z self=, 2025-05-07T20:32:13.2876648Z T=1, 2025-05-07T20:32:13.2876822Z D=7168, 2025-05-07T20:32:13.2877006Z scale_ub=1200.0, 2025-05-07T20:32:13.2877221Z contiguous=True, 2025-05-07T20:32:13.2877430Z compiled=False, 2025-05-07T20:32:13.2877627Z ) 2025-05-07T20:32:13.4442647Z self = 2025-05-07T20:32:13.4444137Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:13.4444901Z 2025-05-07T20:32:13.4445123Z @given( 2025-05-07T20:32:13.4445716Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:13.4446355Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:13.4446977Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:13.4447653Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:13.4448319Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:13.4448924Z ) 2025-05-07T20:32:13.4449626Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:13.4450298Z def test_silu_mul_quant( 2025-05-07T20:32:13.4450548Z self, 2025-05-07T20:32:13.4450749Z T: int, 2025-05-07T20:32:13.4450949Z D: int, 2025-05-07T20:32:13.4451183Z scale_ub: Optional[float], 2025-05-07T20:32:13.4451460Z contiguous: bool, 2025-05-07T20:32:13.4451703Z compiled: bool, 2025-05-07T20:32:13.4458214Z ) -> None: 2025-05-07T20:32:13.4458434Z torch.manual_seed(2025) 2025-05-07T20:32:13.4458687Z 2025-05-07T20:32:13.4458958Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:13.4459306Z 2025-05-07T20:32:13.4459499Z x_sign = torch.sign(x) 2025-05-07T20:32:13.4459791Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:13.4460104Z x = x_sign * x_clamp 2025-05-07T20:32:13.4460343Z x0 = x[:, :D] 2025-05-07T20:32:13.4460553Z x1 = x[:, D:] 2025-05-07T20:32:13.4460758Z 2025-05-07T20:32:13.4460953Z if contiguous: 2025-05-07T20:32:13.4461189Z x0 = x0.contiguous() 2025-05-07T20:32:13.4461442Z x1 = x1.contiguous() 2025-05-07T20:32:13.4461676Z 2025-05-07T20:32:13.4461866Z if scale_ub is not None: 2025-05-07T20:32:13.4462137Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:13.4462477Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:13.4462798Z ) 2025-05-07T20:32:13.4462986Z else: 2025-05-07T20:32:13.4463189Z scale_ub_tensor = None 2025-05-07T20:32:13.4463446Z 2025-05-07T20:32:13.4463677Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:13.4463997Z op = silu_mul_quant 2025-05-07T20:32:13.4464250Z if compiled: 2025-05-07T20:32:13.4464495Z op = torch.compile(op) 2025-05-07T20:32:13.4464793Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:13.4465073Z 2025-05-07T20:32:13.4465268Z > y_fp8, y_scale = fn() 2025-05-07T20:32:13.4465441Z 2025-05-07T20:32:13.4465536Z moe/activation_test.py:117: 2025-05-07T20:32:13.4465833Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:13.4466328Z moe/activation_test.py:115: in fn 2025-05-07T20:32:13.4466616Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:13.4467349Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:13.4468094Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:13.4468649Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:13.4469376Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:13.4470240Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:13.4470801Z kernel = self.compile( 2025-05-07T20:32:13.4471367Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:13.4472064Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:13.4472475Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:13.4472714Z 2025-05-07T20:32:13.4472933Z self = 2025-05-07T20:32:13.4474099Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:13.4475599Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1cef34f550>} 2025-05-07T20:32:13.4477063Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:13.4478256Z context = 2025-05-07T20:32:13.4478558Z 2025-05-07T20:32:13.4478730Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:13.4479267Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:13.4479751Z module_map=module_map) 2025-05-07T20:32:13.4480118Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:13.4480473Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:13.4480735Z E ^ 2025-05-07T20:32:13.4481221Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:13.4481707Z 2025-05-07T20:32:13.4482155Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:13.4482713Z 2025-05-07T20:32:13.4482999Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:13.4483428Z self=, 2025-05-07T20:32:13.4483844Z T=128, 2025-05-07T20:32:13.4484019Z D=5120, 2025-05-07T20:32:13.4484204Z scale_ub=None, 2025-05-07T20:32:13.4484416Z contiguous=True, 2025-05-07T20:32:13.4484627Z compiled=False, 2025-05-07T20:32:13.4484825Z ) 2025-05-07T20:32:13.4485143Z self = 2025-05-07T20:32:13.4485644Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:13.4485929Z 2025-05-07T20:32:13.4486005Z @given( 2025-05-07T20:32:13.4486222Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:13.4486537Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:13.4486848Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:13.4487180Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:13.4487511Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:13.4487925Z ) 2025-05-07T20:32:13.4488284Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:13.4488749Z def test_silu_mul_quant( 2025-05-07T20:32:13.4488989Z self, 2025-05-07T20:32:13.4489176Z T: int, 2025-05-07T20:32:13.4489367Z D: int, 2025-05-07T20:32:13.4489575Z scale_ub: Optional[float], 2025-05-07T20:32:13.4489846Z contiguous: bool, 2025-05-07T20:32:13.4490082Z compiled: bool, 2025-05-07T20:32:13.4490298Z ) -> None: 2025-05-07T20:32:13.4490502Z torch.manual_seed(2025) 2025-05-07T20:32:13.4490737Z 2025-05-07T20:32:13.4491004Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:13.4491351Z 2025-05-07T20:32:13.4491543Z x_sign = torch.sign(x) 2025-05-07T20:32:13.4491839Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:13.4492148Z x = x_sign * x_clamp 2025-05-07T20:32:13.4492383Z x0 = x[:, :D] 2025-05-07T20:32:13.4492607Z x1 = x[:, D:] 2025-05-07T20:32:13.4492804Z 2025-05-07T20:32:13.4492986Z if contiguous: 2025-05-07T20:32:13.4493210Z x0 = x0.contiguous() 2025-05-07T20:32:13.4493468Z x1 = x1.contiguous() 2025-05-07T20:32:13.4493706Z 2025-05-07T20:32:13.4493894Z if scale_ub is not None: 2025-05-07T20:32:13.4494161Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:13.4494499Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:13.4494816Z ) 2025-05-07T20:32:13.4495003Z else: 2025-05-07T20:32:13.4495203Z scale_ub_tensor = None 2025-05-07T20:32:13.4495454Z 2025-05-07T20:32:13.4495681Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:13.4496119Z op = silu_mul_quant 2025-05-07T20:32:13.4496370Z if compiled: 2025-05-07T20:32:13.4496609Z op = torch.compile(op) 2025-05-07T20:32:13.4496907Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:13.4497181Z 2025-05-07T20:32:13.4497367Z > y_fp8, y_scale = fn() 2025-05-07T20:32:13.4497534Z 2025-05-07T20:32:13.4497630Z moe/activation_test.py:117: 2025-05-07T20:32:13.4497923Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:13.4498262Z moe/activation_test.py:115: in fn 2025-05-07T20:32:13.4498539Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:13.4499273Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:13.4500009Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:13.4500567Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:13.4501295Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:13.4501997Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:13.4502561Z kernel = self.compile( 2025-05-07T20:32:13.4503124Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:13.4503815Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:13.4504220Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:13.4504459Z 2025-05-07T20:32:13.4504674Z self = 2025-05-07T20:32:13.4505840Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:13.4507425Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1cef214040>} 2025-05-07T20:32:13.4508890Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:13.4510120Z context = 2025-05-07T20:32:13.4510425Z 2025-05-07T20:32:13.4510600Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:13.4511140Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:13.4511625Z module_map=module_map) 2025-05-07T20:32:13.4511996Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:13.4512356Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:13.4512612Z E ^ 2025-05-07T20:32:13.4513104Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:13.4513589Z 2025-05-07T20:32:13.4514038Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:13.4514591Z 2025-05-07T20:32:13.4514691Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:13.4515115Z self=, 2025-05-07T20:32:13.4515537Z T=128, 2025-05-07T20:32:13.4515727Z D=7168, 2025-05-07T20:32:13.4515911Z scale_ub=None, 2025-05-07T20:32:13.4516121Z contiguous=True, 2025-05-07T20:32:13.4516343Z compiled=False, 2025-05-07T20:32:13.4516539Z ) 2025-05-07T20:32:13.5398864Z self = 2025-05-07T20:32:13.5400379Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:13.5401140Z 2025-05-07T20:32:13.5401357Z @given( 2025-05-07T20:32:13.5401900Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:13.5402530Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:13.5403145Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:13.5403810Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:13.5404464Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:13.5405032Z ) 2025-05-07T20:32:13.5405738Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:13.5406653Z def test_silu_mul_quant( 2025-05-07T20:32:13.5407122Z self, 2025-05-07T20:32:13.5407494Z T: int, 2025-05-07T20:32:13.5407865Z D: int, 2025-05-07T20:32:13.5408293Z scale_ub: Optional[float], 2025-05-07T20:32:13.5408833Z contiguous: bool, 2025-05-07T20:32:13.5409306Z compiled: bool, 2025-05-07T20:32:13.5409563Z ) -> None: 2025-05-07T20:32:13.5409774Z torch.manual_seed(2025) 2025-05-07T20:32:13.5410009Z 2025-05-07T20:32:13.5410279Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:13.5410632Z 2025-05-07T20:32:13.5410818Z x_sign = torch.sign(x) 2025-05-07T20:32:13.5411111Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:13.5411428Z x = x_sign * x_clamp 2025-05-07T20:32:13.5411668Z x0 = x[:, :D] 2025-05-07T20:32:13.5411878Z x1 = x[:, D:] 2025-05-07T20:32:13.5412081Z 2025-05-07T20:32:13.5412265Z if contiguous: 2025-05-07T20:32:13.5412488Z x0 = x0.contiguous() 2025-05-07T20:32:13.5412747Z x1 = x1.contiguous() 2025-05-07T20:32:13.5412990Z 2025-05-07T20:32:13.5413176Z if scale_ub is not None: 2025-05-07T20:32:13.5413453Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:13.5413799Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:13.5414106Z ) 2025-05-07T20:32:13.5414289Z else: 2025-05-07T20:32:13.5414500Z scale_ub_tensor = None 2025-05-07T20:32:13.5414876Z 2025-05-07T20:32:13.5415116Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:13.5415433Z op = silu_mul_quant 2025-05-07T20:32:13.5415682Z if compiled: 2025-05-07T20:32:13.5415925Z op = torch.compile(op) 2025-05-07T20:32:13.5416221Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:13.5416498Z 2025-05-07T20:32:13.5416686Z > y_fp8, y_scale = fn() 2025-05-07T20:32:13.5416849Z 2025-05-07T20:32:13.5416947Z moe/activation_test.py:117: 2025-05-07T20:32:13.5417237Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:13.5417577Z moe/activation_test.py:115: in fn 2025-05-07T20:32:13.5417861Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:13.5418601Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:13.5419347Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:13.5419924Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:13.5420658Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:13.5421358Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:13.5421923Z kernel = self.compile( 2025-05-07T20:32:13.5422489Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:13.5423185Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:13.5423709Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:13.5424217Z 2025-05-07T20:32:13.5424432Z self = 2025-05-07T20:32:13.5425615Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:13.5427115Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1cef214c10>} 2025-05-07T20:32:13.5428577Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:13.5429827Z context = 2025-05-07T20:32:13.5430138Z 2025-05-07T20:32:13.5430307Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:13.5430861Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:13.5431347Z module_map=module_map) 2025-05-07T20:32:13.5431719Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:13.5432077Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:13.5432330Z E ^ 2025-05-07T20:32:13.5432816Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:13.5433306Z 2025-05-07T20:32:13.5433751Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:13.5434305Z 2025-05-07T20:32:13.5434409Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:13.5434826Z self=, 2025-05-07T20:32:13.5435240Z T=2048, 2025-05-07T20:32:13.5435426Z D=7168, 2025-05-07T20:32:13.5435608Z scale_ub=1200.0, 2025-05-07T20:32:13.5435824Z contiguous=True, 2025-05-07T20:32:13.5436038Z compiled=False, 2025-05-07T20:32:13.5436232Z ) 2025-05-07T20:32:13.5436642Z self = 2025-05-07T20:32:13.5437164Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:13.5437454Z 2025-05-07T20:32:13.5437534Z @given( 2025-05-07T20:32:13.5437756Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:13.5438069Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:13.5438378Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:13.5438708Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:13.5439042Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:13.5439330Z ) 2025-05-07T20:32:13.5439686Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:13.5440153Z def test_silu_mul_quant( 2025-05-07T20:32:13.5440394Z self, 2025-05-07T20:32:13.5440583Z T: int, 2025-05-07T20:32:13.5440769Z D: int, 2025-05-07T20:32:13.5440990Z scale_ub: Optional[float], 2025-05-07T20:32:13.5441260Z contiguous: bool, 2025-05-07T20:32:13.5441492Z compiled: bool, 2025-05-07T20:32:13.5441709Z ) -> None: 2025-05-07T20:32:13.5441922Z torch.manual_seed(2025) 2025-05-07T20:32:13.5442158Z 2025-05-07T20:32:13.5442430Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:13.5444672Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.70 GiB is allocated by PyTorch, and 53.93 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:13.5446810Z 2025-05-07T20:32:13.5446936Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:13.5447153Z 2025-05-07T20:32:13.5447257Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:13.5447678Z self=, 2025-05-07T20:32:13.5448099Z T=1, 2025-05-07T20:32:13.5448287Z D=5120, 2025-05-07T20:32:13.5448476Z scale_ub=1200.0, 2025-05-07T20:32:13.5448694Z contiguous=True, 2025-05-07T20:32:13.5448911Z compiled=False, 2025-05-07T20:32:13.5449104Z ) 2025-05-07T20:32:13.5927539Z self = 2025-05-07T20:32:13.5928318Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:13.5928704Z 2025-05-07T20:32:13.5928821Z @given( 2025-05-07T20:32:13.5929131Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:13.5929643Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:13.5930286Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:13.5930954Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:13.5931622Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:13.5932199Z ) 2025-05-07T20:32:13.5932904Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:13.5933815Z def test_silu_mul_quant( 2025-05-07T20:32:13.5934280Z self, 2025-05-07T20:32:13.5934645Z T: int, 2025-05-07T20:32:13.5935019Z D: int, 2025-05-07T20:32:13.5935436Z scale_ub: Optional[float], 2025-05-07T20:32:13.5935966Z contiguous: bool, 2025-05-07T20:32:13.5936431Z compiled: bool, 2025-05-07T20:32:13.5936855Z ) -> None: 2025-05-07T20:32:13.5937267Z torch.manual_seed(2025) 2025-05-07T20:32:13.5937740Z 2025-05-07T20:32:13.5938261Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:13.5938957Z 2025-05-07T20:32:13.5939531Z x_sign = torch.sign(x) 2025-05-07T20:32:13.5939867Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:13.5940182Z x = x_sign * x_clamp 2025-05-07T20:32:13.5940419Z x0 = x[:, :D] 2025-05-07T20:32:13.5940635Z x1 = x[:, D:] 2025-05-07T20:32:13.5940834Z 2025-05-07T20:32:13.5941014Z if contiguous: 2025-05-07T20:32:13.5941240Z x0 = x0.contiguous() 2025-05-07T20:32:13.5941492Z x1 = x1.contiguous() 2025-05-07T20:32:13.5941732Z 2025-05-07T20:32:13.5941922Z if scale_ub is not None: 2025-05-07T20:32:13.5942189Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:13.5942525Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:13.5942839Z ) 2025-05-07T20:32:13.5943027Z else: 2025-05-07T20:32:13.5943235Z scale_ub_tensor = None 2025-05-07T20:32:13.5943485Z 2025-05-07T20:32:13.5943709Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:13.5944034Z op = silu_mul_quant 2025-05-07T20:32:13.5944278Z if compiled: 2025-05-07T20:32:13.5944516Z op = torch.compile(op) 2025-05-07T20:32:13.5944815Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:13.5945099Z 2025-05-07T20:32:13.5945283Z > y_fp8, y_scale = fn() 2025-05-07T20:32:13.5945452Z 2025-05-07T20:32:13.5945547Z moe/activation_test.py:117: 2025-05-07T20:32:13.5945842Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:13.5946180Z moe/activation_test.py:115: in fn 2025-05-07T20:32:13.5946457Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:13.5947196Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:13.5948054Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:13.5948617Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:13.5949351Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:13.5950166Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:13.5950731Z kernel = self.compile( 2025-05-07T20:32:13.5951291Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:13.5951981Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:13.5952389Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:13.5952629Z 2025-05-07T20:32:13.5952844Z self = 2025-05-07T20:32:13.5954018Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:13.5955514Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1cef3089d0>} 2025-05-07T20:32:13.5956972Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:13.5958077Z context = 2025-05-07T20:32:13.5958379Z 2025-05-07T20:32:13.5958548Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:13.5959091Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:13.5959581Z module_map=module_map) 2025-05-07T20:32:13.5959954Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:13.5960391Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:13.5960652Z E ^ 2025-05-07T20:32:13.5961139Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:13.5961625Z 2025-05-07T20:32:13.5962073Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:13.5962625Z 2025-05-07T20:32:13.5962728Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:13.5963152Z self=, 2025-05-07T20:32:13.5963570Z T=2048, 2025-05-07T20:32:13.5963746Z D=5120, 2025-05-07T20:32:13.5963933Z scale_ub=None, 2025-05-07T20:32:13.5964140Z contiguous=True, 2025-05-07T20:32:13.5964361Z compiled=False, 2025-05-07T20:32:13.5964562Z ) 2025-05-07T20:32:13.5964882Z self = 2025-05-07T20:32:13.5965395Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:13.5965686Z 2025-05-07T20:32:13.5965761Z @given( 2025-05-07T20:32:13.5965983Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:13.5966299Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:13.5966605Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:13.5966937Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:13.5967269Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:13.5967552Z ) 2025-05-07T20:32:13.5967910Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:13.5968368Z def test_silu_mul_quant( 2025-05-07T20:32:13.5968604Z self, 2025-05-07T20:32:13.5968879Z T: int, 2025-05-07T20:32:13.5969068Z D: int, 2025-05-07T20:32:13.5969275Z scale_ub: Optional[float], 2025-05-07T20:32:13.5969547Z contiguous: bool, 2025-05-07T20:32:13.5969793Z compiled: bool, 2025-05-07T20:32:13.5970014Z ) -> None: 2025-05-07T20:32:13.5970223Z torch.manual_seed(2025) 2025-05-07T20:32:13.5970459Z 2025-05-07T20:32:13.5970732Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:13.5971082Z 2025-05-07T20:32:13.5971271Z > x_sign = torch.sign(x) 2025-05-07T20:32:13.5973416Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:13.5975468Z 2025-05-07T20:32:13.5975595Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:32:13.5975813Z 2025-05-07T20:32:13.5975913Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:13.5976330Z self=, 2025-05-07T20:32:13.5976748Z T=16384, 2025-05-07T20:32:13.5976931Z D=5120, 2025-05-07T20:32:13.5977112Z scale_ub=None, 2025-05-07T20:32:13.5977316Z contiguous=True, 2025-05-07T20:32:13.5977532Z compiled=False, 2025-05-07T20:32:13.5977725Z ) 2025-05-07T20:32:13.5978041Z self = 2025-05-07T20:32:13.5978555Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:13.5978851Z 2025-05-07T20:32:13.5978924Z @given( 2025-05-07T20:32:13.5979150Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:13.5979461Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:13.5979766Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:13.5980212Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:13.5987283Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:13.5987625Z ) 2025-05-07T20:32:13.5987999Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:13.5988467Z def test_silu_mul_quant( 2025-05-07T20:32:13.5988713Z self, 2025-05-07T20:32:13.5988911Z T: int, 2025-05-07T20:32:13.5989100Z D: int, 2025-05-07T20:32:13.5989320Z scale_ub: Optional[float], 2025-05-07T20:32:13.5989596Z contiguous: bool, 2025-05-07T20:32:13.5989927Z compiled: bool, 2025-05-07T20:32:13.5990149Z ) -> None: 2025-05-07T20:32:13.5990365Z torch.manual_seed(2025) 2025-05-07T20:32:13.5990617Z 2025-05-07T20:32:13.5990888Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:13.5993157Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:13.5995219Z 2025-05-07T20:32:13.5995336Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:13.5995553Z 2025-05-07T20:32:13.5995660Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:13.5996092Z self=, 2025-05-07T20:32:13.5996672Z T=4096, 2025-05-07T20:32:13.5996854Z D=5120, 2025-05-07T20:32:13.5997039Z scale_ub=None, 2025-05-07T20:32:13.5997245Z contiguous=True, 2025-05-07T20:32:13.5997466Z compiled=False, 2025-05-07T20:32:13.5997676Z ) 2025-05-07T20:32:13.7016559Z self = 2025-05-07T20:32:13.7017371Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:13.7017795Z 2025-05-07T20:32:13.7017910Z @given( 2025-05-07T20:32:13.7018231Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:13.7018660Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:13.7019053Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:13.7019392Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:13.7019731Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:13.7020027Z ) 2025-05-07T20:32:13.7020379Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:13.7020855Z def test_silu_mul_quant( 2025-05-07T20:32:13.7021099Z self, 2025-05-07T20:32:13.7021282Z T: int, 2025-05-07T20:32:13.7021481Z D: int, 2025-05-07T20:32:13.7021700Z scale_ub: Optional[float], 2025-05-07T20:32:13.7021972Z contiguous: bool, 2025-05-07T20:32:13.7022208Z compiled: bool, 2025-05-07T20:32:13.7022434Z ) -> None: 2025-05-07T20:32:13.7022647Z torch.manual_seed(2025) 2025-05-07T20:32:13.7022894Z 2025-05-07T20:32:13.7023168Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:13.7025421Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:13.7027663Z 2025-05-07T20:32:13.7027785Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:13.7028005Z 2025-05-07T20:32:13.7028108Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:13.7028533Z self=, 2025-05-07T20:32:13.7028956Z T=2048, 2025-05-07T20:32:13.7029135Z D=5120, 2025-05-07T20:32:13.7029324Z scale_ub=None, 2025-05-07T20:32:13.7029538Z contiguous=False, 2025-05-07T20:32:13.7029893Z compiled=False, 2025-05-07T20:32:13.7030101Z ) 2025-05-07T20:32:13.7030421Z self = 2025-05-07T20:32:13.7030932Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:13.7031231Z 2025-05-07T20:32:13.7031305Z @given( 2025-05-07T20:32:13.7031532Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:13.7031846Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:13.7032154Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:13.7032496Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:13.7032828Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:13.7033110Z ) 2025-05-07T20:32:13.7033465Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:13.7033927Z def test_silu_mul_quant( 2025-05-07T20:32:13.7034163Z self, 2025-05-07T20:32:13.7034352Z T: int, 2025-05-07T20:32:13.7034547Z D: int, 2025-05-07T20:32:13.7034769Z scale_ub: Optional[float], 2025-05-07T20:32:13.7035036Z contiguous: bool, 2025-05-07T20:32:13.7035279Z compiled: bool, 2025-05-07T20:32:13.7035495Z ) -> None: 2025-05-07T20:32:13.7035835Z torch.manual_seed(2025) 2025-05-07T20:32:13.7036078Z 2025-05-07T20:32:13.7036349Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:13.7038586Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:13.7040640Z 2025-05-07T20:32:13.7040756Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:13.7040978Z 2025-05-07T20:32:13.7041078Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:13.7041498Z self=, 2025-05-07T20:32:13.7041917Z T=4096, 2025-05-07T20:32:13.7042091Z D=7168, 2025-05-07T20:32:13.7042276Z scale_ub=None, 2025-05-07T20:32:13.7042487Z contiguous=True, 2025-05-07T20:32:13.7042695Z compiled=True, 2025-05-07T20:32:13.7042891Z ) 2025-05-07T20:32:13.7043211Z self = 2025-05-07T20:32:13.7043715Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:13.7044002Z 2025-05-07T20:32:13.7044075Z @given( 2025-05-07T20:32:13.7044300Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:13.7044798Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:13.7045135Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:13.7045504Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:13.7045874Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:13.7046194Z ) 2025-05-07T20:32:13.7046590Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:13.7047110Z def test_silu_mul_quant( 2025-05-07T20:32:13.7047365Z self, 2025-05-07T20:32:13.7047649Z T: int, 2025-05-07T20:32:13.7047854Z D: int, 2025-05-07T20:32:13.7048079Z scale_ub: Optional[float], 2025-05-07T20:32:13.7048372Z contiguous: bool, 2025-05-07T20:32:13.7048629Z compiled: bool, 2025-05-07T20:32:13.7048856Z ) -> None: 2025-05-07T20:32:13.7049076Z torch.manual_seed(2025) 2025-05-07T20:32:13.7049334Z 2025-05-07T20:32:13.7049619Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:13.7052289Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:13.7054712Z 2025-05-07T20:32:13.7054841Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:13.7055082Z 2025-05-07T20:32:13.7055189Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:13.7055662Z self=, 2025-05-07T20:32:13.7056124Z T=2048, 2025-05-07T20:32:13.7056311Z D=5120, 2025-05-07T20:32:13.7056508Z scale_ub=1200.0, 2025-05-07T20:32:13.7056744Z contiguous=False, 2025-05-07T20:32:13.7056976Z compiled=False, 2025-05-07T20:32:13.7057190Z ) 2025-05-07T20:32:13.7057541Z self = 2025-05-07T20:32:13.7058115Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:13.7058534Z 2025-05-07T20:32:13.7058613Z @given( 2025-05-07T20:32:13.7058850Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:13.7059200Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:13.7059532Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:13.7059910Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:13.7060284Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:13.7060596Z ) 2025-05-07T20:32:13.7060992Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:13.7061517Z def test_silu_mul_quant( 2025-05-07T20:32:13.7061768Z self, 2025-05-07T20:32:13.7061970Z T: int, 2025-05-07T20:32:13.7062176Z D: int, 2025-05-07T20:32:13.7062398Z scale_ub: Optional[float], 2025-05-07T20:32:13.7062694Z contiguous: bool, 2025-05-07T20:32:13.7062958Z compiled: bool, 2025-05-07T20:32:13.7063201Z ) -> None: 2025-05-07T20:32:13.7063421Z torch.manual_seed(2025) 2025-05-07T20:32:13.7063683Z 2025-05-07T20:32:13.7063980Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:13.7066608Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:13.7069023Z 2025-05-07T20:32:13.7069148Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:13.7069395Z 2025-05-07T20:32:13.7069504Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:13.7070037Z self=, 2025-05-07T20:32:13.7070496Z T=4096, 2025-05-07T20:32:13.7070685Z D=7168, 2025-05-07T20:32:13.7070965Z scale_ub=1200.0, 2025-05-07T20:32:13.7071202Z contiguous=True, 2025-05-07T20:32:13.7071431Z compiled=False, 2025-05-07T20:32:13.7071648Z ) 2025-05-07T20:32:13.7072001Z self = 2025-05-07T20:32:13.7072570Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:13.7072896Z 2025-05-07T20:32:13.7072973Z @given( 2025-05-07T20:32:13.7073212Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:13.7073556Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:13.7073895Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:13.7074261Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:13.7074631Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:13.7074943Z ) 2025-05-07T20:32:13.7075335Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:13.7075857Z def test_silu_mul_quant( 2025-05-07T20:32:13.7076112Z self, 2025-05-07T20:32:13.7076310Z T: int, 2025-05-07T20:32:13.7076520Z D: int, 2025-05-07T20:32:13.7076742Z scale_ub: Optional[float], 2025-05-07T20:32:13.7077037Z contiguous: bool, 2025-05-07T20:32:13.7077291Z compiled: bool, 2025-05-07T20:32:13.7077522Z ) -> None: 2025-05-07T20:32:13.7077746Z torch.manual_seed(2025) 2025-05-07T20:32:13.7078004Z 2025-05-07T20:32:13.7078290Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:13.7080931Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:13.7083328Z 2025-05-07T20:32:13.7083450Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:13.7083678Z 2025-05-07T20:32:13.7083777Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:13.7084203Z self=, 2025-05-07T20:32:13.7084718Z T=16384, 2025-05-07T20:32:13.7084919Z D=7168, 2025-05-07T20:32:13.7085101Z scale_ub=None, 2025-05-07T20:32:13.7085306Z contiguous=False, 2025-05-07T20:32:13.7085528Z compiled=True, 2025-05-07T20:32:13.7085726Z ) 2025-05-07T20:32:13.8371260Z self = 2025-05-07T20:32:13.8372060Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:13.8372468Z 2025-05-07T20:32:13.8372562Z @given( 2025-05-07T20:32:13.8372793Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:13.8373115Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:13.8373426Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:13.8373753Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:13.8374088Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:13.8374381Z ) 2025-05-07T20:32:13.8374736Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:13.8375202Z def test_silu_mul_quant( 2025-05-07T20:32:13.8375446Z self, 2025-05-07T20:32:13.8375634Z T: int, 2025-05-07T20:32:13.8375836Z D: int, 2025-05-07T20:32:13.8376055Z scale_ub: Optional[float], 2025-05-07T20:32:13.8376337Z contiguous: bool, 2025-05-07T20:32:13.8376574Z compiled: bool, 2025-05-07T20:32:13.8376799Z ) -> None: 2025-05-07T20:32:13.8377007Z torch.manual_seed(2025) 2025-05-07T20:32:13.8377435Z 2025-05-07T20:32:13.8377712Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:13.8379968Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:13.8382032Z 2025-05-07T20:32:13.8382154Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:13.8382374Z 2025-05-07T20:32:13.8382474Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:13.8383203Z self=, 2025-05-07T20:32:13.8383629Z T=4096, 2025-05-07T20:32:13.8383820Z D=7168, 2025-05-07T20:32:13.8384002Z scale_ub=None, 2025-05-07T20:32:13.8384212Z contiguous=True, 2025-05-07T20:32:13.8384427Z compiled=False, 2025-05-07T20:32:13.8384624Z ) 2025-05-07T20:32:13.8384941Z self = 2025-05-07T20:32:13.8385455Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:13.8385739Z 2025-05-07T20:32:13.8385816Z @given( 2025-05-07T20:32:13.8386040Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:13.8386363Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:13.8386665Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:13.8387125Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:13.8387459Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:13.8387750Z ) 2025-05-07T20:32:13.8388106Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:13.8388567Z def test_silu_mul_quant( 2025-05-07T20:32:13.8388813Z self, 2025-05-07T20:32:13.8389000Z T: int, 2025-05-07T20:32:13.8389195Z D: int, 2025-05-07T20:32:13.8389411Z scale_ub: Optional[float], 2025-05-07T20:32:13.8389815Z contiguous: bool, 2025-05-07T20:32:13.8390057Z compiled: bool, 2025-05-07T20:32:13.8390277Z ) -> None: 2025-05-07T20:32:13.8390486Z torch.manual_seed(2025) 2025-05-07T20:32:13.8390730Z 2025-05-07T20:32:13.8391000Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:13.8393248Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:13.8395308Z 2025-05-07T20:32:13.8395427Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:13.8395645Z 2025-05-07T20:32:13.8395749Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:13.8396169Z self=, 2025-05-07T20:32:13.8396582Z T=16384, 2025-05-07T20:32:13.8396766Z D=7168, 2025-05-07T20:32:13.8396954Z scale_ub=None, 2025-05-07T20:32:13.8397164Z contiguous=True, 2025-05-07T20:32:13.8397380Z compiled=False, 2025-05-07T20:32:13.8397587Z ) 2025-05-07T20:32:13.8397909Z self = 2025-05-07T20:32:13.8398417Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:13.8398835Z 2025-05-07T20:32:13.8398913Z @given( 2025-05-07T20:32:13.8399136Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:13.8399459Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:13.8399770Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:13.8400112Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:13.8400451Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:13.8400741Z ) 2025-05-07T20:32:13.8401108Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:13.8401575Z def test_silu_mul_quant( 2025-05-07T20:32:13.8401820Z self, 2025-05-07T20:32:13.8402020Z T: int, 2025-05-07T20:32:13.8402221Z D: int, 2025-05-07T20:32:13.8402444Z scale_ub: Optional[float], 2025-05-07T20:32:13.8402720Z contiguous: bool, 2025-05-07T20:32:13.8402967Z compiled: bool, 2025-05-07T20:32:13.8403191Z ) -> None: 2025-05-07T20:32:13.8403412Z torch.manual_seed(2025) 2025-05-07T20:32:13.8403654Z 2025-05-07T20:32:13.8403928Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:13.8406165Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:13.8408309Z 2025-05-07T20:32:13.8408425Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:13.8408646Z 2025-05-07T20:32:13.8408748Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:13.8409173Z self=, 2025-05-07T20:32:13.8409595Z T=16384, 2025-05-07T20:32:13.8409786Z D=7168, 2025-05-07T20:32:13.8409972Z scale_ub=1200.0, 2025-05-07T20:32:13.8410192Z contiguous=True, 2025-05-07T20:32:13.8410406Z compiled=False, 2025-05-07T20:32:13.8410608Z ) 2025-05-07T20:32:13.8410930Z self = 2025-05-07T20:32:13.8411442Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:13.8411738Z 2025-05-07T20:32:13.8411812Z @given( 2025-05-07T20:32:13.8412041Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:13.8412349Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:13.8412669Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:13.8413000Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:13.8413337Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:13.8413627Z ) 2025-05-07T20:32:13.8413983Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:13.8414443Z def test_silu_mul_quant( 2025-05-07T20:32:13.8414684Z self, 2025-05-07T20:32:13.8414875Z T: int, 2025-05-07T20:32:13.8415066Z D: int, 2025-05-07T20:32:13.8415275Z scale_ub: Optional[float], 2025-05-07T20:32:13.8415549Z contiguous: bool, 2025-05-07T20:32:13.8415785Z compiled: bool, 2025-05-07T20:32:13.8416001Z ) -> None: 2025-05-07T20:32:13.8416212Z torch.manual_seed(2025) 2025-05-07T20:32:13.8416458Z 2025-05-07T20:32:13.8416720Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:13.8419087Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:13.8421152Z 2025-05-07T20:32:13.8421268Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:13.8421491Z 2025-05-07T20:32:13.8421591Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:13.8422012Z self=, 2025-05-07T20:32:13.8422423Z T=128, 2025-05-07T20:32:13.8422606Z D=5120, 2025-05-07T20:32:13.8422793Z scale_ub=1200.0, 2025-05-07T20:32:13.8423013Z contiguous=False, 2025-05-07T20:32:13.8423235Z compiled=False, 2025-05-07T20:32:13.8423437Z ) 2025-05-07T20:32:14.2249861Z self = 2025-05-07T20:32:14.2251440Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:14.2252232Z 2025-05-07T20:32:14.2252441Z @given( 2025-05-07T20:32:14.2253029Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.2253665Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.2254276Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.2254945Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.2255607Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.2256181Z ) 2025-05-07T20:32:14.2256895Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.2257819Z def test_silu_mul_quant( 2025-05-07T20:32:14.2258289Z self, 2025-05-07T20:32:14.2259011Z T: int, 2025-05-07T20:32:14.2259397Z D: int, 2025-05-07T20:32:14.2259660Z scale_ub: Optional[float], 2025-05-07T20:32:14.2259932Z contiguous: bool, 2025-05-07T20:32:14.2260181Z compiled: bool, 2025-05-07T20:32:14.2260407Z ) -> None: 2025-05-07T20:32:14.2260614Z torch.manual_seed(2025) 2025-05-07T20:32:14.2260858Z 2025-05-07T20:32:14.2261131Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.2261481Z 2025-05-07T20:32:14.2261674Z x_sign = torch.sign(x) 2025-05-07T20:32:14.2261970Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:14.2262280Z x = x_sign * x_clamp 2025-05-07T20:32:14.2262524Z x0 = x[:, :D] 2025-05-07T20:32:14.2262740Z x1 = x[:, D:] 2025-05-07T20:32:14.2262947Z 2025-05-07T20:32:14.2263126Z if contiguous: 2025-05-07T20:32:14.2263359Z x0 = x0.contiguous() 2025-05-07T20:32:14.2263620Z x1 = x1.contiguous() 2025-05-07T20:32:14.2263861Z 2025-05-07T20:32:14.2264051Z if scale_ub is not None: 2025-05-07T20:32:14.2264321Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:14.2264667Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:14.2264983Z ) 2025-05-07T20:32:14.2265173Z else: 2025-05-07T20:32:14.2265377Z scale_ub_tensor = None 2025-05-07T20:32:14.2265626Z 2025-05-07T20:32:14.2265860Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:14.2266175Z op = silu_mul_quant 2025-05-07T20:32:14.2266426Z if compiled: 2025-05-07T20:32:14.2266669Z op = torch.compile(op) 2025-05-07T20:32:14.2266965Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.2267246Z 2025-05-07T20:32:14.2267437Z > y_fp8, y_scale = fn() 2025-05-07T20:32:14.2267602Z 2025-05-07T20:32:14.2267699Z moe/activation_test.py:117: 2025-05-07T20:32:14.2268009Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.2268355Z moe/activation_test.py:115: in fn 2025-05-07T20:32:14.2268639Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.2269493Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:14.2270430Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:14.2270996Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:14.2271724Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:14.2272420Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:14.2272984Z kernel = self.compile( 2025-05-07T20:32:14.2273556Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:14.2274254Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:14.2274670Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.2282115Z 2025-05-07T20:32:14.2282356Z self = 2025-05-07T20:32:14.2283730Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:14.2285241Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1ceefeb670>} 2025-05-07T20:32:14.2286718Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:14.2287992Z context = 2025-05-07T20:32:14.2288299Z 2025-05-07T20:32:14.2288481Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:14.2289029Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:14.2289515Z module_map=module_map) 2025-05-07T20:32:14.2289889Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:14.2290255Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:14.2290520Z E ^ 2025-05-07T20:32:14.2291009Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:14.2291495Z 2025-05-07T20:32:14.2291945Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:14.2292506Z 2025-05-07T20:32:14.2292614Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.2293029Z self=, 2025-05-07T20:32:14.2293448Z T=2048, 2025-05-07T20:32:14.2293638Z D=7168, 2025-05-07T20:32:14.2293820Z scale_ub=None, 2025-05-07T20:32:14.2294035Z contiguous=False, 2025-05-07T20:32:14.2294257Z compiled=False, 2025-05-07T20:32:14.2294459Z ) 2025-05-07T20:32:14.2294783Z self = 2025-05-07T20:32:14.2295297Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:14.2295585Z 2025-05-07T20:32:14.2295665Z @given( 2025-05-07T20:32:14.2295886Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.2296201Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.2296512Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.2296841Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.2297185Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.2297474Z ) 2025-05-07T20:32:14.2297944Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.2298409Z def test_silu_mul_quant( 2025-05-07T20:32:14.2298646Z self, 2025-05-07T20:32:14.2298835Z T: int, 2025-05-07T20:32:14.2299027Z D: int, 2025-05-07T20:32:14.2299237Z scale_ub: Optional[float], 2025-05-07T20:32:14.2299507Z contiguous: bool, 2025-05-07T20:32:14.2299742Z compiled: bool, 2025-05-07T20:32:14.2299955Z ) -> None: 2025-05-07T20:32:14.2300169Z torch.manual_seed(2025) 2025-05-07T20:32:14.2300410Z 2025-05-07T20:32:14.2300688Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.2302931Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 5.24 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:14.2304981Z 2025-05-07T20:32:14.2305096Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:14.2305318Z 2025-05-07T20:32:14.2305418Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.2305840Z self=, 2025-05-07T20:32:14.2306252Z T=128, 2025-05-07T20:32:14.2306429Z D=7168, 2025-05-07T20:32:14.2306616Z scale_ub=1200.0, 2025-05-07T20:32:14.2306827Z contiguous=True, 2025-05-07T20:32:14.2307045Z compiled=True, 2025-05-07T20:32:14.2307243Z ) 2025-05-07T20:32:14.2749439Z self = 2025-05-07T20:32:14.2750356Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:14.2750759Z 2025-05-07T20:32:14.2750882Z @given( 2025-05-07T20:32:14.2751195Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.2751611Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.2751935Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.2752289Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.2752627Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.2752919Z ) 2025-05-07T20:32:14.2753284Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.2753749Z def test_silu_mul_quant( 2025-05-07T20:32:14.2753996Z self, 2025-05-07T20:32:14.2754192Z T: int, 2025-05-07T20:32:14.2754393Z D: int, 2025-05-07T20:32:14.2754611Z scale_ub: Optional[float], 2025-05-07T20:32:14.2754899Z contiguous: bool, 2025-05-07T20:32:14.2755146Z compiled: bool, 2025-05-07T20:32:14.2755381Z ) -> None: 2025-05-07T20:32:14.2755603Z torch.manual_seed(2025) 2025-05-07T20:32:14.2755860Z 2025-05-07T20:32:14.2756134Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.2756489Z 2025-05-07T20:32:14.2756685Z x_sign = torch.sign(x) 2025-05-07T20:32:14.2756982Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:14.2757298Z x = x_sign * x_clamp 2025-05-07T20:32:14.2757544Z x0 = x[:, :D] 2025-05-07T20:32:14.2757763Z x1 = x[:, D:] 2025-05-07T20:32:14.2757972Z 2025-05-07T20:32:14.2758160Z if contiguous: 2025-05-07T20:32:14.2758397Z x0 = x0.contiguous() 2025-05-07T20:32:14.2758659Z x1 = x1.contiguous() 2025-05-07T20:32:14.2758909Z 2025-05-07T20:32:14.2759102Z if scale_ub is not None: 2025-05-07T20:32:14.2759389Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:14.2759733Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:14.2760062Z ) 2025-05-07T20:32:14.2760473Z else: 2025-05-07T20:32:14.2760686Z scale_ub_tensor = None 2025-05-07T20:32:14.2760939Z 2025-05-07T20:32:14.2761169Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:14.2761491Z op = silu_mul_quant 2025-05-07T20:32:14.2761744Z if compiled: 2025-05-07T20:32:14.2761995Z op = torch.compile(op) 2025-05-07T20:32:14.2762292Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.2762582Z 2025-05-07T20:32:14.2762773Z > y_fp8, y_scale = fn() 2025-05-07T20:32:14.2762938Z 2025-05-07T20:32:14.2763035Z moe/activation_test.py:117: 2025-05-07T20:32:14.2763333Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.2763679Z moe/activation_test.py:115: in fn 2025-05-07T20:32:14.2763960Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.2764552Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:14.2765152Z return fn(*args, **kwargs) 2025-05-07T20:32:14.2765862Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:14.2766600Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:14.2767166Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:14.2767891Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:14.2768597Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:14.2769162Z kernel = self.compile( 2025-05-07T20:32:14.2769735Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:14.2770559Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:14.2770973Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.2771220Z 2025-05-07T20:32:14.2771432Z self = 2025-05-07T20:32:14.2772605Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:14.2774110Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1ceefda5e0>} 2025-05-07T20:32:14.2775578Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:14.2776687Z context = 2025-05-07T20:32:14.2776999Z 2025-05-07T20:32:14.2777166Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:14.2777717Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:14.2778203Z module_map=module_map) 2025-05-07T20:32:14.2778573Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:14.2778934Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:14.2779200Z E ^ 2025-05-07T20:32:14.2779686Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:14.2780177Z 2025-05-07T20:32:14.2780624Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:14.2781189Z 2025-05-07T20:32:14.2781289Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.2781799Z self=, 2025-05-07T20:32:14.2782217Z T=128, 2025-05-07T20:32:14.2782411Z D=7168, 2025-05-07T20:32:14.2782602Z scale_ub=1200.0, 2025-05-07T20:32:14.2782999Z contiguous=True, 2025-05-07T20:32:14.2783224Z compiled=False, 2025-05-07T20:32:14.2783431Z ) 2025-05-07T20:32:14.2783750Z self = 2025-05-07T20:32:14.2784268Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:14.2784553Z 2025-05-07T20:32:14.2784633Z @given( 2025-05-07T20:32:14.2784865Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.2785179Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.2785493Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.2785834Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.2786166Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.2786473Z ) 2025-05-07T20:32:14.2786838Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.2787300Z def test_silu_mul_quant( 2025-05-07T20:32:14.2787538Z self, 2025-05-07T20:32:14.2787731Z T: int, 2025-05-07T20:32:14.2787928Z D: int, 2025-05-07T20:32:14.2788140Z scale_ub: Optional[float], 2025-05-07T20:32:14.2788414Z contiguous: bool, 2025-05-07T20:32:14.2788655Z compiled: bool, 2025-05-07T20:32:14.2788869Z ) -> None: 2025-05-07T20:32:14.2789084Z torch.manual_seed(2025) 2025-05-07T20:32:14.2789335Z 2025-05-07T20:32:14.2789637Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.2790100Z 2025-05-07T20:32:14.2790296Z x_sign = torch.sign(x) 2025-05-07T20:32:14.2790712Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:14.2792919Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 4.62 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:14.2794972Z 2025-05-07T20:32:14.2795090Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:14.2795312Z 2025-05-07T20:32:14.2795413Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.2795834Z self=, 2025-05-07T20:32:14.2796249Z T=128, 2025-05-07T20:32:14.2796436Z D=5120, 2025-05-07T20:32:14.2796629Z scale_ub=1200.0, 2025-05-07T20:32:14.2796844Z contiguous=True, 2025-05-07T20:32:14.2797063Z compiled=True, 2025-05-07T20:32:14.2797263Z ) 2025-05-07T20:32:14.2797585Z self = 2025-05-07T20:32:14.2798097Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:14.2798384Z 2025-05-07T20:32:14.2798462Z @given( 2025-05-07T20:32:14.2798693Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.2799010Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.2799335Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.2799678Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.2800011Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.2800311Z ) 2025-05-07T20:32:14.2800675Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.2801143Z def test_silu_mul_quant( 2025-05-07T20:32:14.2801400Z self, 2025-05-07T20:32:14.2801595Z T: int, 2025-05-07T20:32:14.2801792Z D: int, 2025-05-07T20:32:14.2802158Z scale_ub: Optional[float], 2025-05-07T20:32:14.2802435Z contiguous: bool, 2025-05-07T20:32:14.2802669Z compiled: bool, 2025-05-07T20:32:14.2802890Z ) -> None: 2025-05-07T20:32:14.2803109Z torch.manual_seed(2025) 2025-05-07T20:32:14.2803345Z 2025-05-07T20:32:14.2803620Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.2803973Z 2025-05-07T20:32:14.2804162Z > x_sign = torch.sign(x) 2025-05-07T20:32:14.2806285Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:14.2808760Z 2025-05-07T20:32:14.2808884Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:32:14.2809108Z 2025-05-07T20:32:14.2809209Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.2809633Z self=, 2025-05-07T20:32:14.2810046Z T=128, 2025-05-07T20:32:14.2810224Z D=7168, 2025-05-07T20:32:14.2810412Z scale_ub=None, 2025-05-07T20:32:14.2810614Z contiguous=True, 2025-05-07T20:32:14.2810832Z compiled=True, 2025-05-07T20:32:14.2811032Z ) 2025-05-07T20:32:14.5856123Z self = 2025-05-07T20:32:14.5857563Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:14.5858471Z 2025-05-07T20:32:14.5858630Z @given( 2025-05-07T20:32:14.5859089Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.5859694Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.5860050Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.5860391Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.5860724Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.5861017Z ) 2025-05-07T20:32:14.5861379Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.5861841Z def test_silu_mul_quant( 2025-05-07T20:32:14.5862079Z self, 2025-05-07T20:32:14.5862272Z T: int, 2025-05-07T20:32:14.5862474Z D: int, 2025-05-07T20:32:14.5862689Z scale_ub: Optional[float], 2025-05-07T20:32:14.5862965Z contiguous: bool, 2025-05-07T20:32:14.5863206Z compiled: bool, 2025-05-07T20:32:14.5863433Z ) -> None: 2025-05-07T20:32:14.5863653Z torch.manual_seed(2025) 2025-05-07T20:32:14.5863898Z 2025-05-07T20:32:14.5864169Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.5866407Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:14.5868449Z 2025-05-07T20:32:14.5868567Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:14.5868788Z 2025-05-07T20:32:14.5926143Z FAILED 2025-05-07T20:32:14.5926343Z 2025-05-07T20:32:14.5926526Z =================================== FAILURES =================================== 2025-05-07T20:32:14.5927158Z _____________________ ActivationTests.test_silu_mul_quant ______________________ 2025-05-07T20:32:14.5927968Z + Exception Group Traceback (most recent call last): 2025-05-07T20:32:14.5928861Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/unittest/case.py", line 59, in testPartExecutor 2025-05-07T20:32:14.5929656Z | yield 2025-05-07T20:32:14.5930263Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/unittest/case.py", line 592, in run 2025-05-07T20:32:14.5930997Z | self._callTestMethod(testMethod) 2025-05-07T20:32:14.5931793Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/unittest/case.py", line 550, in _callTestMethod 2025-05-07T20:32:14.5932562Z | method() 2025-05-07T20:32:14.5933466Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 75, in test_silu_mul_quant 2025-05-07T20:32:14.5934514Z | T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.5935422Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/hypothesis/core.py", line 1850, in wrapped_test 2025-05-07T20:32:14.5936327Z | raise the_error_hypothesis_found 2025-05-07T20:32:14.5937012Z | exceptiongroup.ExceptionGroup: Hypothesis found 4 distinct failures. (4 sub-exceptions) 2025-05-07T20:32:14.5937718Z +-+---------------- 1 ---------------- 2025-05-07T20:32:14.5938118Z | Traceback (most recent call last): 2025-05-07T20:32:14.5939138Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:32:14.5940243Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.5943300Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:14.5946317Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:32:14.5946782Z | self=, 2025-05-07T20:32:14.5947215Z | T=128, 2025-05-07T20:32:14.5947417Z | D=7168, 2025-05-07T20:32:14.5947634Z | scale_ub=1200.0, 2025-05-07T20:32:14.5947885Z | contiguous=True, 2025-05-07T20:32:14.5948126Z | compiled=False, 2025-05-07T20:32:14.5948358Z | ) 2025-05-07T20:32:14.5948542Z | 2025-05-07T20:32:14.5949101Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQFBAUEAQQE=') as a decorator on your test case 2025-05-07T20:32:14.5949903Z +---------------- 2 ---------------- 2025-05-07T20:32:14.5950230Z | Traceback (most recent call last): 2025-05-07T20:32:14.5950994Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:32:14.5951826Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.5954059Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:14.5956316Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:32:14.5956774Z | self=, 2025-05-07T20:32:14.5957199Z | T=128, 2025-05-07T20:32:14.5957392Z | D=7168, 2025-05-07T20:32:14.5957599Z | scale_ub=None, 2025-05-07T20:32:14.5957837Z | contiguous=True, 2025-05-07T20:32:14.5958074Z | compiled=True, 2025-05-07T20:32:14.5958293Z | ) 2025-05-07T20:32:14.5958469Z | 2025-05-07T20:32:14.5959012Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQFBAEEAQQA=') as a decorator on your test case 2025-05-07T20:32:14.5959661Z +---------------- 3 ---------------- 2025-05-07T20:32:14.5959953Z | Traceback (most recent call last): 2025-05-07T20:32:14.5960715Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:32:14.5961546Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.5964252Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:14.5966986Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:32:14.5967610Z | self=, 2025-05-07T20:32:14.5968313Z | T=128, 2025-05-07T20:32:14.5968591Z | D=5120, 2025-05-07T20:32:14.5968880Z | scale_ub=1200.0, 2025-05-07T20:32:14.5969220Z | contiguous=True, 2025-05-07T20:32:14.5969553Z | compiled=True, 2025-05-07T20:32:14.5969855Z | ) 2025-05-07T20:32:14.5970107Z | 2025-05-07T20:32:14.5970853Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQBBAUEAQQA=') as a decorator on your test case 2025-05-07T20:32:14.5971748Z +---------------- 4 ---------------- 2025-05-07T20:32:14.5972154Z | Traceback (most recent call last): 2025-05-07T20:32:14.5973165Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 126, in test_silu_mul_quant 2025-05-07T20:32:14.5974183Z | y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:14.5975128Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 124, in ref_fn 2025-05-07T20:32:14.5976135Z | return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:14.5977334Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py", line 2370, in triton_quantize_fp8_row 2025-05-07T20:32:14.5978496Z | _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:14.5979364Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py", line 330, in 2025-05-07T20:32:14.5980409Z | return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:14.5981464Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py", line 186, in run 2025-05-07T20:32:14.5982570Z | timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:14.5983970Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py", line 186, in 2025-05-07T20:32:14.5985336Z | timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:14.5986193Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py", line 166, in _bench 2025-05-07T20:32:14.5986941Z | return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:14.6003928Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py", line 117, in do_bench 2025-05-07T20:32:14.6004827Z | fn() 2025-05-07T20:32:14.6005697Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py", line 152, in kernel_call 2025-05-07T20:32:14.6006632Z | self.fn.run( 2025-05-07T20:32:14.6007419Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py", line 623, in run 2025-05-07T20:32:14.6008277Z | kernel = self.compile( 2025-05-07T20:32:14.6009156Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 273, in compile 2025-05-07T20:32:14.6010142Z | module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:14.6011196Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:14.6012411Z | return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:14.6013172Z | triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:14.6013691Z | def _kernel_quantize_fp8_row( 2025-05-07T20:32:14.6014080Z | ^ 2025-05-07T20:32:14.6014774Z | ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:14.6015888Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:32:14.6016482Z | # The test always failed when commented parts were varied together. 2025-05-07T20:32:14.6017235Z | self=, 2025-05-07T20:32:14.6017863Z | T=1, # or any other generated value 2025-05-07T20:32:14.6018316Z | D=5120, # or any other generated value 2025-05-07T20:32:14.6018806Z | scale_ub=None, # or any other generated value 2025-05-07T20:32:14.6019336Z | contiguous=True, # or any other generated value 2025-05-07T20:32:14.6019854Z | compiled=True, # or any other generated value 2025-05-07T20:32:14.6020303Z | ) 2025-05-07T20:32:14.6020569Z | 2025-05-07T20:32:14.6021349Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEAQQBBAEEAQQA=') as a decorator on your test case 2025-05-07T20:32:14.6022288Z +------------------------------------ 2025-05-07T20:32:14.6022810Z ---------------------------------- Hypothesis ---------------------------------- 2025-05-07T20:32:14.6023354Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.6023957Z self=, 2025-05-07T20:32:14.6024540Z T=1, 2025-05-07T20:32:14.6024804Z D=5120, 2025-05-07T20:32:14.6025072Z scale_ub=None, 2025-05-07T20:32:14.6025381Z contiguous=True, 2025-05-07T20:32:14.6025702Z compiled=True, 2025-05-07T20:32:14.6025996Z ) 2025-05-07T20:32:14.6026461Z self = 2025-05-07T20:32:14.6027175Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:14.6027557Z 2025-05-07T20:32:14.6027671Z @given( 2025-05-07T20:32:14.6028003Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.6028445Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.6028852Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.6029291Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.6030011Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.6030429Z ) 2025-05-07T20:32:14.6030904Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.6031515Z def test_silu_mul_quant( 2025-05-07T20:32:14.6031852Z self, 2025-05-07T20:32:14.6032128Z T: int, 2025-05-07T20:32:14.6032396Z D: int, 2025-05-07T20:32:14.6032680Z scale_ub: Optional[float], 2025-05-07T20:32:14.6033036Z contiguous: bool, 2025-05-07T20:32:14.6033365Z compiled: bool, 2025-05-07T20:32:14.6033684Z ) -> None: 2025-05-07T20:32:14.6033982Z torch.manual_seed(2025) 2025-05-07T20:32:14.6034329Z 2025-05-07T20:32:14.6034714Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.6035215Z 2025-05-07T20:32:14.6035486Z x_sign = torch.sign(x) 2025-05-07T20:32:14.6035900Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:14.6036339Z x = x_sign * x_clamp 2025-05-07T20:32:14.6036694Z x0 = x[:, :D] 2025-05-07T20:32:14.6037003Z x1 = x[:, D:] 2025-05-07T20:32:14.6037299Z 2025-05-07T20:32:14.6037553Z if contiguous: 2025-05-07T20:32:14.6037863Z x0 = x0.contiguous() 2025-05-07T20:32:14.6038207Z x1 = x1.contiguous() 2025-05-07T20:32:14.6038540Z 2025-05-07T20:32:14.6038798Z if scale_ub is not None: 2025-05-07T20:32:14.6039174Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:14.6039666Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:14.6040120Z ) 2025-05-07T20:32:14.6040379Z else: 2025-05-07T20:32:14.6040661Z scale_ub_tensor = None 2025-05-07T20:32:14.6041003Z 2025-05-07T20:32:14.6041312Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:14.6041853Z op = silu_mul_quant 2025-05-07T20:32:14.6042194Z if compiled: 2025-05-07T20:32:14.6042550Z op = torch.compile(op) 2025-05-07T20:32:14.6042976Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.6043382Z 2025-05-07T20:32:14.6043649Z y_fp8, y_scale = fn() 2025-05-07T20:32:14.6044047Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:14.6044477Z 2025-05-07T20:32:14.6044806Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:14.6045280Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:14.6045707Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:14.6046162Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:14.6046684Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:14.6047130Z 2025-05-07T20:32:14.6047413Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:14.6047704Z 2025-05-07T20:32:14.6047852Z moe/activation_test.py:126: 2025-05-07T20:32:14.6048268Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.6048763Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:14.6049235Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:14.6050417Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:14.6051516Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:14.6052295Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:14.6053299Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:14.6054300Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:14.6055378Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:14.6056593Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:14.6057610Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:14.6058527Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:14.6059330Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:14.6060190Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:14.6060959Z fn() 2025-05-07T20:32:14.6061646Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:14.6062508Z self.fn.run( 2025-05-07T20:32:14.6063202Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:14.6063967Z kernel = self.compile( 2025-05-07T20:32:14.6064744Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:14.6065729Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:14.6066237Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.6066531Z 2025-05-07T20:32:14.6066824Z self = 2025-05-07T20:32:14.6068321Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:14.6070219Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1f214764c0>} 2025-05-07T20:32:14.6072056Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:14.6073392Z context = 2025-05-07T20:32:14.6073768Z 2025-05-07T20:32:14.6073970Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:14.6074644Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:14.6075229Z module_map=module_map) 2025-05-07T20:32:14.6075664Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:14.6076109Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:14.6076440Z E ^ 2025-05-07T20:32:14.6077012Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:14.6077604Z 2025-05-07T20:32:14.6078131Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:14.6078782Z 2025-05-07T20:32:14.6078910Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.6079416Z self=, 2025-05-07T20:32:14.6079914Z T=2048, 2025-05-07T20:32:14.6080142Z D=5120, 2025-05-07T20:32:14.6080376Z scale_ub=1200.0, 2025-05-07T20:32:14.6080640Z contiguous=True, 2025-05-07T20:32:14.6080910Z compiled=False, 2025-05-07T20:32:14.6081161Z ) 2025-05-07T20:32:14.6081545Z self = 2025-05-07T20:32:14.6082164Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:14.6082512Z 2025-05-07T20:32:14.6082605Z @given( 2025-05-07T20:32:14.6083185Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.6083579Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.6083967Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.6084592Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.6085021Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.6085378Z ) 2025-05-07T20:32:14.6085811Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.6086366Z def test_silu_mul_quant( 2025-05-07T20:32:14.6086662Z self, 2025-05-07T20:32:14.6086896Z T: int, 2025-05-07T20:32:14.6087136Z D: int, 2025-05-07T20:32:14.6087392Z scale_ub: Optional[float], 2025-05-07T20:32:14.6087728Z contiguous: bool, 2025-05-07T20:32:14.6088044Z compiled: bool, 2025-05-07T20:32:14.6088355Z ) -> None: 2025-05-07T20:32:14.6088667Z torch.manual_seed(2025) 2025-05-07T20:32:14.6089027Z 2025-05-07T20:32:14.6089419Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.6089916Z 2025-05-07T20:32:14.6090157Z x_sign = torch.sign(x) 2025-05-07T20:32:14.6090518Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:14.6090939Z x = x_sign * x_clamp 2025-05-07T20:32:14.6091248Z x0 = x[:, :D] 2025-05-07T20:32:14.6091529Z x1 = x[:, D:] 2025-05-07T20:32:14.6091815Z 2025-05-07T20:32:14.6092072Z if contiguous: 2025-05-07T20:32:14.6092363Z x0 = x0.contiguous() 2025-05-07T20:32:14.6092732Z x1 = x1.contiguous() 2025-05-07T20:32:14.6093077Z 2025-05-07T20:32:14.6093337Z if scale_ub is not None: 2025-05-07T20:32:14.6093730Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:14.6094202Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:14.6094638Z ) 2025-05-07T20:32:14.6094894Z else: 2025-05-07T20:32:14.6095336Z scale_ub_tensor = None 2025-05-07T20:32:14.6095685Z 2025-05-07T20:32:14.6095987Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:14.6096427Z op = silu_mul_quant 2025-05-07T20:32:14.6096779Z if compiled: 2025-05-07T20:32:14.6097108Z op = torch.compile(op) 2025-05-07T20:32:14.6097518Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.6097900Z 2025-05-07T20:32:14.6098149Z > y_fp8, y_scale = fn() 2025-05-07T20:32:14.6098382Z 2025-05-07T20:32:14.6098514Z moe/activation_test.py:117: 2025-05-07T20:32:14.6098918Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.6099383Z moe/activation_test.py:115: in fn 2025-05-07T20:32:14.6099760Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.6100765Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:14.6101784Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:14.6102552Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:14.6103569Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:14.6104565Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:14.6105335Z kernel = self.compile( 2025-05-07T20:32:14.6106117Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:14.6107069Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:14.6107633Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.6107971Z 2025-05-07T20:32:14.6108260Z self = 2025-05-07T20:32:14.6109966Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:14.6112055Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1f214d3ca0>} 2025-05-07T20:32:14.6114038Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:14.6115570Z context = 2025-05-07T20:32:14.6116002Z 2025-05-07T20:32:14.6116234Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:14.6116989Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:14.6117672Z module_map=module_map) 2025-05-07T20:32:14.6118177Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:14.6118677Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:14.6119052Z E ^ 2025-05-07T20:32:14.6119731Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:14.6120417Z 2025-05-07T20:32:14.6121030Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:14.6123007Z 2025-05-07T20:32:14.6123147Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.6123724Z self=, 2025-05-07T20:32:14.6124293Z T=2048, 2025-05-07T20:32:14.6124555Z D=5120, 2025-05-07T20:32:14.6124826Z scale_ub=1200.0, 2025-05-07T20:32:14.6125127Z contiguous=True, 2025-05-07T20:32:14.6125443Z compiled=True, 2025-05-07T20:32:14.6125837Z ) 2025-05-07T20:32:14.6126281Z self = 2025-05-07T20:32:14.6127006Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:14.6127392Z 2025-05-07T20:32:14.6127505Z @given( 2025-05-07T20:32:14.6127813Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.6128258Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.6128694Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.6129162Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.6129621Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.6130063Z ) 2025-05-07T20:32:14.6130577Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.6131198Z def test_silu_mul_quant( 2025-05-07T20:32:14.6131545Z self, 2025-05-07T20:32:14.6131808Z T: int, 2025-05-07T20:32:14.6132063Z D: int, 2025-05-07T20:32:14.6132367Z scale_ub: Optional[float], 2025-05-07T20:32:14.6132751Z contiguous: bool, 2025-05-07T20:32:14.6133088Z compiled: bool, 2025-05-07T20:32:14.6133417Z ) -> None: 2025-05-07T20:32:14.6133722Z torch.manual_seed(2025) 2025-05-07T20:32:14.6134062Z 2025-05-07T20:32:14.6134450Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.6134937Z 2025-05-07T20:32:14.6135212Z x_sign = torch.sign(x) 2025-05-07T20:32:14.6135615Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:14.6136063Z x = x_sign * x_clamp 2025-05-07T20:32:14.6136409Z x0 = x[:, :D] 2025-05-07T20:32:14.6136702Z x1 = x[:, D:] 2025-05-07T20:32:14.6136998Z 2025-05-07T20:32:14.6137250Z if contiguous: 2025-05-07T20:32:14.6137573Z x0 = x0.contiguous() 2025-05-07T20:32:14.6137951Z x1 = x1.contiguous() 2025-05-07T20:32:14.6138298Z 2025-05-07T20:32:14.6138565Z if scale_ub is not None: 2025-05-07T20:32:14.6138958Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:14.6139440Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:14.6139983Z ) 2025-05-07T20:32:14.6140262Z else: 2025-05-07T20:32:14.6140553Z scale_ub_tensor = None 2025-05-07T20:32:14.6140886Z 2025-05-07T20:32:14.6141212Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:14.6141653Z op = silu_mul_quant 2025-05-07T20:32:14.6141986Z if compiled: 2025-05-07T20:32:14.6142334Z op = torch.compile(op) 2025-05-07T20:32:14.6142761Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.6143161Z 2025-05-07T20:32:14.6143421Z y_fp8, y_scale = fn() 2025-05-07T20:32:14.6143826Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:14.6144240Z 2025-05-07T20:32:14.6144563Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:14.6145044Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:14.6145455Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:14.6145897Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:14.6146417Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:14.6146863Z 2025-05-07T20:32:14.6147134Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:14.6147410Z 2025-05-07T20:32:14.6147535Z moe/activation_test.py:126: 2025-05-07T20:32:14.6147944Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.6148378Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:14.6148794Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:14.6150080Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:14.6151165Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:14.6152066Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:14.6153074Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:14.6154098Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:14.6155181Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:14.6156296Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:14.6157387Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:14.6158442Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:14.6159356Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:14.6160276Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:14.6161032Z fn() 2025-05-07T20:32:14.6161758Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:14.6162605Z self.fn.run( 2025-05-07T20:32:14.6163259Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:14.6164001Z kernel = self.compile( 2025-05-07T20:32:14.6164739Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:14.6165653Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:14.6166154Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.6166444Z 2025-05-07T20:32:14.6166742Z self = 2025-05-07T20:32:14.6168507Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:14.6170661Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1f21ae3af0>} 2025-05-07T20:32:14.6172669Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:14.6174138Z context = 2025-05-07T20:32:14.6174530Z 2025-05-07T20:32:14.6174754Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:14.6175477Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:14.6176178Z module_map=module_map) 2025-05-07T20:32:14.6176716Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:14.6177223Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:14.6177596Z E ^ 2025-05-07T20:32:14.6178268Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:14.6178943Z 2025-05-07T20:32:14.6179564Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:14.6180332Z 2025-05-07T20:32:14.6180482Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.6181069Z self=, 2025-05-07T20:32:14.6181657Z T=16384, 2025-05-07T20:32:14.6181927Z D=7168, 2025-05-07T20:32:14.6182308Z scale_ub=1200.0, 2025-05-07T20:32:14.6182622Z contiguous=False, 2025-05-07T20:32:14.6183192Z compiled=False, 2025-05-07T20:32:14.6183474Z ) 2025-05-07T20:32:14.6183907Z self = 2025-05-07T20:32:14.6184595Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:14.6184980Z 2025-05-07T20:32:14.6185085Z @given( 2025-05-07T20:32:14.6185393Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.6185819Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.6186246Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.6186692Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.6187139Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.6187538Z ) 2025-05-07T20:32:14.6188044Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.6188693Z def test_silu_mul_quant( 2025-05-07T20:32:14.6189044Z self, 2025-05-07T20:32:14.6189303Z T: int, 2025-05-07T20:32:14.6189562Z D: int, 2025-05-07T20:32:14.6189962Z scale_ub: Optional[float], 2025-05-07T20:32:14.6190325Z contiguous: bool, 2025-05-07T20:32:14.6190641Z compiled: bool, 2025-05-07T20:32:14.6190936Z ) -> None: 2025-05-07T20:32:14.6191214Z torch.manual_seed(2025) 2025-05-07T20:32:14.6191537Z 2025-05-07T20:32:14.6191892Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.6192362Z 2025-05-07T20:32:14.6192614Z x_sign = torch.sign(x) 2025-05-07T20:32:14.6193001Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:14.6193422Z x = x_sign * x_clamp 2025-05-07T20:32:14.6193741Z x0 = x[:, :D] 2025-05-07T20:32:14.6194036Z x1 = x[:, D:] 2025-05-07T20:32:14.6194313Z 2025-05-07T20:32:14.6194550Z if contiguous: 2025-05-07T20:32:14.6194863Z x0 = x0.contiguous() 2025-05-07T20:32:14.6195226Z x1 = x1.contiguous() 2025-05-07T20:32:14.6195580Z 2025-05-07T20:32:14.6195860Z if scale_ub is not None: 2025-05-07T20:32:14.6196473Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:14.6196943Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:14.6197384Z ) 2025-05-07T20:32:14.6197659Z else: 2025-05-07T20:32:14.6197950Z scale_ub_tensor = None 2025-05-07T20:32:14.6198296Z 2025-05-07T20:32:14.6198606Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:14.6199052Z op = silu_mul_quant 2025-05-07T20:32:14.6199393Z if compiled: 2025-05-07T20:32:14.6199761Z op = torch.compile(op) 2025-05-07T20:32:14.6200204Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.6200577Z 2025-05-07T20:32:14.6200841Z > y_fp8, y_scale = fn() 2025-05-07T20:32:14.6201073Z 2025-05-07T20:32:14.6201220Z moe/activation_test.py:117: 2025-05-07T20:32:14.6201646Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.6202132Z moe/activation_test.py:115: in fn 2025-05-07T20:32:14.6202548Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.6203510Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:14.6204506Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:14.6205273Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:14.6206261Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:14.6207238Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:14.6208032Z kernel = self.compile( 2025-05-07T20:32:14.6208841Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:14.6210001Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:14.6210552Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.6210871Z 2025-05-07T20:32:14.6211158Z self = 2025-05-07T20:32:14.6212759Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:14.6214852Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1ee08d40d0>} 2025-05-07T20:32:14.6216850Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:14.6218238Z context = 2025-05-07T20:32:14.6218682Z 2025-05-07T20:32:14.6243216Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:14.6243867Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:14.6244384Z module_map=module_map) 2025-05-07T20:32:14.6244770Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:14.6245147Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:14.6245416Z E ^ 2025-05-07T20:32:14.6245915Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:14.6246405Z 2025-05-07T20:32:14.6246862Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:14.6247437Z 2025-05-07T20:32:14.6247541Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.6247977Z self=, 2025-05-07T20:32:14.6248569Z T=1, 2025-05-07T20:32:14.6248752Z D=7168, 2025-05-07T20:32:14.6248946Z scale_ub=None, 2025-05-07T20:32:14.6249160Z contiguous=True, 2025-05-07T20:32:14.6249375Z compiled=True, 2025-05-07T20:32:14.6249583Z ) 2025-05-07T20:32:14.6249913Z self = 2025-05-07T20:32:14.6250416Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:14.6250698Z 2025-05-07T20:32:14.6250776Z @given( 2025-05-07T20:32:14.6251010Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.6251327Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.6251644Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.6251993Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.6252333Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.6252629Z ) 2025-05-07T20:32:14.6252997Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.6253463Z def test_silu_mul_quant( 2025-05-07T20:32:14.6253702Z self, 2025-05-07T20:32:14.6253894Z T: int, 2025-05-07T20:32:14.6254088Z D: int, 2025-05-07T20:32:14.6254298Z scale_ub: Optional[float], 2025-05-07T20:32:14.6254574Z contiguous: bool, 2025-05-07T20:32:14.6254815Z compiled: bool, 2025-05-07T20:32:14.6255030Z ) -> None: 2025-05-07T20:32:14.6255248Z torch.manual_seed(2025) 2025-05-07T20:32:14.6255493Z 2025-05-07T20:32:14.6255760Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.6256119Z 2025-05-07T20:32:14.6256316Z x_sign = torch.sign(x) 2025-05-07T20:32:14.6256606Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:14.6257040Z x = x_sign * x_clamp 2025-05-07T20:32:14.6257288Z x0 = x[:, :D] 2025-05-07T20:32:14.6257499Z x1 = x[:, D:] 2025-05-07T20:32:14.6257708Z 2025-05-07T20:32:14.6257899Z if contiguous: 2025-05-07T20:32:14.6258126Z x0 = x0.contiguous() 2025-05-07T20:32:14.6258389Z x1 = x1.contiguous() 2025-05-07T20:32:14.6258631Z 2025-05-07T20:32:14.6258813Z if scale_ub is not None: 2025-05-07T20:32:14.6259091Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:14.6259434Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:14.6259755Z ) 2025-05-07T20:32:14.6259936Z else: 2025-05-07T20:32:14.6260145Z scale_ub_tensor = None 2025-05-07T20:32:14.6260402Z 2025-05-07T20:32:14.6260628Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:14.6260950Z op = silu_mul_quant 2025-05-07T20:32:14.6261209Z if compiled: 2025-05-07T20:32:14.6261452Z op = torch.compile(op) 2025-05-07T20:32:14.6261759Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.6262045Z 2025-05-07T20:32:14.6262232Z y_fp8, y_scale = fn() 2025-05-07T20:32:14.6262520Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:14.6262812Z 2025-05-07T20:32:14.6263046Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:14.6263388Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:14.6263690Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:14.6264012Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:14.6264379Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:14.6264701Z 2025-05-07T20:32:14.6264903Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:14.6265105Z 2025-05-07T20:32:14.6265202Z moe/activation_test.py:126: 2025-05-07T20:32:14.6265504Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.6265861Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:14.6266186Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:14.6267124Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:14.6267951Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:14.6268531Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:14.6269259Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:14.6270194Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:14.6270968Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:14.6271787Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:14.6272596Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:14.6273389Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:14.6274080Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:14.6274728Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:14.6275282Z fn() 2025-05-07T20:32:14.6275822Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:14.6276450Z self.fn.run( 2025-05-07T20:32:14.6276992Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:14.6277600Z kernel = self.compile( 2025-05-07T20:32:14.6278270Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:14.6278972Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:14.6279380Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.6279634Z 2025-05-07T20:32:14.6279849Z self = 2025-05-07T20:32:14.6281018Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:14.6282542Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1f21dbaa60>} 2025-05-07T20:32:14.6284303Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:14.6285425Z context = 2025-05-07T20:32:14.6285735Z 2025-05-07T20:32:14.6285903Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:14.6286452Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:14.6286936Z module_map=module_map) 2025-05-07T20:32:14.6287313Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:14.6287681Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:14.6287950Z E ^ 2025-05-07T20:32:14.6288445Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:14.6288943Z 2025-05-07T20:32:14.6289400Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:14.6289966Z 2025-05-07T20:32:14.6290093Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.6290745Z self=, 2025-05-07T20:32:14.6291173Z T=4096, 2025-05-07T20:32:14.6291355Z D=5120, 2025-05-07T20:32:14.6291539Z scale_ub=None, 2025-05-07T20:32:14.6291756Z contiguous=False, 2025-05-07T20:32:14.6291982Z compiled=False, 2025-05-07T20:32:14.6292180Z ) 2025-05-07T20:32:14.6292502Z self = 2025-05-07T20:32:14.6293021Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:14.6293309Z 2025-05-07T20:32:14.6293391Z @given( 2025-05-07T20:32:14.6293614Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.6293934Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.6294253Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.6294586Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.6294930Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.6295226Z ) 2025-05-07T20:32:14.6295578Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.6296043Z def test_silu_mul_quant( 2025-05-07T20:32:14.6296291Z self, 2025-05-07T20:32:14.6296485Z T: int, 2025-05-07T20:32:14.6296672Z D: int, 2025-05-07T20:32:14.6296888Z scale_ub: Optional[float], 2025-05-07T20:32:14.6297163Z contiguous: bool, 2025-05-07T20:32:14.6297397Z compiled: bool, 2025-05-07T20:32:14.6297620Z ) -> None: 2025-05-07T20:32:14.6297837Z torch.manual_seed(2025) 2025-05-07T20:32:14.6298076Z 2025-05-07T20:32:14.6298346Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.6298871Z 2025-05-07T20:32:14.6299059Z x_sign = torch.sign(x) 2025-05-07T20:32:14.6299353Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:14.6299673Z x = x_sign * x_clamp 2025-05-07T20:32:14.6299913Z x0 = x[:, :D] 2025-05-07T20:32:14.6300130Z x1 = x[:, D:] 2025-05-07T20:32:14.6300333Z 2025-05-07T20:32:14.6300507Z if contiguous: 2025-05-07T20:32:14.6300740Z x0 = x0.contiguous() 2025-05-07T20:32:14.6300998Z x1 = x1.contiguous() 2025-05-07T20:32:14.6301235Z 2025-05-07T20:32:14.6301425Z if scale_ub is not None: 2025-05-07T20:32:14.6301698Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:14.6302042Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:14.6302352Z ) 2025-05-07T20:32:14.6302542Z else: 2025-05-07T20:32:14.6302751Z scale_ub_tensor = None 2025-05-07T20:32:14.6302999Z 2025-05-07T20:32:14.6303230Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:14.6303557Z op = silu_mul_quant 2025-05-07T20:32:14.6303804Z if compiled: 2025-05-07T20:32:14.6304051Z op = torch.compile(op) 2025-05-07T20:32:14.6304358Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.6304633Z 2025-05-07T20:32:14.6304826Z > y_fp8, y_scale = fn() 2025-05-07T20:32:14.6304993Z 2025-05-07T20:32:14.6305096Z moe/activation_test.py:117: 2025-05-07T20:32:14.6305390Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.6305732Z moe/activation_test.py:115: in fn 2025-05-07T20:32:14.6306015Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.6306751Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:14.6307487Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:14.6308055Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:14.6308792Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:14.6309588Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:14.6310277Z kernel = self.compile( 2025-05-07T20:32:14.6310852Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:14.6311555Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:14.6311962Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.6312209Z 2025-05-07T20:32:14.6312422Z self = 2025-05-07T20:32:14.6313592Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:14.6315106Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1f21ae0700>} 2025-05-07T20:32:14.6316569Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:14.6317670Z context = 2025-05-07T20:32:14.6317982Z 2025-05-07T20:32:14.6318148Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:14.6318701Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:14.6319193Z module_map=module_map) 2025-05-07T20:32:14.6319567Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:14.6320025Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:14.6320296Z E ^ 2025-05-07T20:32:14.6320784Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:14.6321280Z 2025-05-07T20:32:14.6321728Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:14.6322290Z 2025-05-07T20:32:14.6322392Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.6322814Z self=, 2025-05-07T20:32:14.6323243Z T=4096, 2025-05-07T20:32:14.6323437Z D=7168, 2025-05-07T20:32:14.6323625Z scale_ub=None, 2025-05-07T20:32:14.6323842Z contiguous=False, 2025-05-07T20:32:14.6324068Z compiled=False, 2025-05-07T20:32:14.6324267Z ) 2025-05-07T20:32:14.6324596Z self = 2025-05-07T20:32:14.6325124Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:14.6325412Z 2025-05-07T20:32:14.6325489Z @given( 2025-05-07T20:32:14.6325725Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.6326045Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.6326361Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.6326692Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.6327031Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.6327327Z ) 2025-05-07T20:32:14.6327680Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.6328155Z def test_silu_mul_quant( 2025-05-07T20:32:14.6328404Z self, 2025-05-07T20:32:14.6328591Z T: int, 2025-05-07T20:32:14.6328783Z D: int, 2025-05-07T20:32:14.6329002Z scale_ub: Optional[float], 2025-05-07T20:32:14.6329266Z contiguous: bool, 2025-05-07T20:32:14.6329513Z compiled: bool, 2025-05-07T20:32:14.6329773Z ) -> None: 2025-05-07T20:32:14.6329993Z torch.manual_seed(2025) 2025-05-07T20:32:14.6330236Z 2025-05-07T20:32:14.6330596Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.6330950Z 2025-05-07T20:32:14.6331139Z x_sign = torch.sign(x) 2025-05-07T20:32:14.6331435Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:14.6331749Z x = x_sign * x_clamp 2025-05-07T20:32:14.6331981Z x0 = x[:, :D] 2025-05-07T20:32:14.6332198Z x1 = x[:, D:] 2025-05-07T20:32:14.6332406Z 2025-05-07T20:32:14.6332583Z if contiguous: 2025-05-07T20:32:14.6332818Z x0 = x0.contiguous() 2025-05-07T20:32:14.6333081Z x1 = x1.contiguous() 2025-05-07T20:32:14.6333317Z 2025-05-07T20:32:14.6333507Z if scale_ub is not None: 2025-05-07T20:32:14.6333785Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:14.6334124Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:14.6334443Z ) 2025-05-07T20:32:14.6334633Z else: 2025-05-07T20:32:14.6334848Z scale_ub_tensor = None 2025-05-07T20:32:14.6335100Z 2025-05-07T20:32:14.6335328Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:14.6335649Z op = silu_mul_quant 2025-05-07T20:32:14.6335897Z if compiled: 2025-05-07T20:32:14.6336148Z op = torch.compile(op) 2025-05-07T20:32:14.6336450Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.6336728Z 2025-05-07T20:32:14.6336920Z > y_fp8, y_scale = fn() 2025-05-07T20:32:14.6337087Z 2025-05-07T20:32:14.6337192Z moe/activation_test.py:117: 2025-05-07T20:32:14.6337489Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.6337834Z moe/activation_test.py:115: in fn 2025-05-07T20:32:14.6338122Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.6338942Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:14.6339692Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:14.6340260Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:14.6340993Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:14.6341695Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:14.6342262Z kernel = self.compile( 2025-05-07T20:32:14.6342837Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:14.6343536Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:14.6343943Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.6344196Z 2025-05-07T20:32:14.6344408Z self = 2025-05-07T20:32:14.6345582Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:14.6347088Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1f1c6960d0>} 2025-05-07T20:32:14.6348549Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:14.6349662Z context = 2025-05-07T20:32:14.6350092Z 2025-05-07T20:32:14.6350266Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:14.6350812Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:14.6351384Z module_map=module_map) 2025-05-07T20:32:14.6351765Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:14.6352128Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:14.6352385Z E ^ 2025-05-07T20:32:14.6352871Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:14.6353365Z 2025-05-07T20:32:14.6353812Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:14.6354366Z 2025-05-07T20:32:14.6354475Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.6354895Z self=, 2025-05-07T20:32:14.6355317Z T=128, 2025-05-07T20:32:14.6355498Z D=7168, 2025-05-07T20:32:14.6355684Z scale_ub=None, 2025-05-07T20:32:14.6355897Z contiguous=False, 2025-05-07T20:32:14.6356121Z compiled=True, 2025-05-07T20:32:14.6356324Z ) 2025-05-07T20:32:14.6356642Z self = 2025-05-07T20:32:14.6357155Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:14.6357434Z 2025-05-07T20:32:14.6357518Z @given( 2025-05-07T20:32:14.6357739Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.6358060Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.6358374Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.6358708Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.6359045Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.6359349Z ) 2025-05-07T20:32:14.6359759Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.6360308Z def test_silu_mul_quant( 2025-05-07T20:32:14.6360552Z self, 2025-05-07T20:32:14.6360745Z T: int, 2025-05-07T20:32:14.6360942Z D: int, 2025-05-07T20:32:14.6361165Z scale_ub: Optional[float], 2025-05-07T20:32:14.6361441Z contiguous: bool, 2025-05-07T20:32:14.6361678Z compiled: bool, 2025-05-07T20:32:14.6361902Z ) -> None: 2025-05-07T20:32:14.6362117Z torch.manual_seed(2025) 2025-05-07T20:32:14.6362355Z 2025-05-07T20:32:14.6362630Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.6362988Z 2025-05-07T20:32:14.6363173Z x_sign = torch.sign(x) 2025-05-07T20:32:14.6363466Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:14.6363786Z x = x_sign * x_clamp 2025-05-07T20:32:14.6364018Z x0 = x[:, :D] 2025-05-07T20:32:14.6364231Z x1 = x[:, D:] 2025-05-07T20:32:14.6364441Z 2025-05-07T20:32:14.6364622Z if contiguous: 2025-05-07T20:32:14.6364849Z x0 = x0.contiguous() 2025-05-07T20:32:14.6365112Z x1 = x1.contiguous() 2025-05-07T20:32:14.6365355Z 2025-05-07T20:32:14.6365540Z if scale_ub is not None: 2025-05-07T20:32:14.6365815Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:14.6366156Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:14.6366465Z ) 2025-05-07T20:32:14.6366657Z else: 2025-05-07T20:32:14.6366866Z scale_ub_tensor = None 2025-05-07T20:32:14.6367112Z 2025-05-07T20:32:14.6367338Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:14.6367661Z op = silu_mul_quant 2025-05-07T20:32:14.6367907Z if compiled: 2025-05-07T20:32:14.6368162Z op = torch.compile(op) 2025-05-07T20:32:14.6368463Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.6368738Z 2025-05-07T20:32:14.6368926Z y_fp8, y_scale = fn() 2025-05-07T20:32:14.6369221Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:14.6369513Z 2025-05-07T20:32:14.6369749Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:14.6370208Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:14.6370511Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:14.6370828Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:14.6371198Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:14.6371514Z 2025-05-07T20:32:14.6371710Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:14.6371916Z 2025-05-07T20:32:14.6372014Z moe/activation_test.py:126: 2025-05-07T20:32:14.6372317Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.6372658Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:14.6372993Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:14.6373842Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:14.6374660Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:14.6375238Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:14.6375970Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:14.6376707Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:14.6377480Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:14.6378286Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:14.6379089Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:14.6379956Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:14.6380645Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:14.6381279Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:14.6381836Z fn() 2025-05-07T20:32:14.6382375Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:14.6383279Z self.fn.run( 2025-05-07T20:32:14.6383775Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:14.6384342Z kernel = self.compile( 2025-05-07T20:32:14.6384905Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:14.6385602Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:14.6386023Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.6386266Z 2025-05-07T20:32:14.6386488Z self = 2025-05-07T20:32:14.6387650Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:14.6389156Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1f1d2114c0>} 2025-05-07T20:32:14.6390765Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:14.6391872Z context = 2025-05-07T20:32:14.6392182Z 2025-05-07T20:32:14.6392361Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:14.6393057Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:14.6393553Z module_map=module_map) 2025-05-07T20:32:14.6393931Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:14.6394290Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:14.6394562Z E ^ 2025-05-07T20:32:14.6395053Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:14.6395542Z 2025-05-07T20:32:14.6395993Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:14.6396549Z 2025-05-07T20:32:14.6396649Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.6397078Z self=, 2025-05-07T20:32:14.6397496Z T=128, 2025-05-07T20:32:14.6397675Z D=7168, 2025-05-07T20:32:14.6397866Z scale_ub=None, 2025-05-07T20:32:14.6398085Z contiguous=False, 2025-05-07T20:32:14.6398309Z compiled=False, 2025-05-07T20:32:14.6398506Z ) 2025-05-07T20:32:14.6398831Z self = 2025-05-07T20:32:14.6399345Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:14.6399628Z 2025-05-07T20:32:14.6399704Z @given( 2025-05-07T20:32:14.6399945Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.6407617Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.6407954Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.6408305Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.6408650Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.6409125Z ) 2025-05-07T20:32:14.6409499Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.6409969Z def test_silu_mul_quant( 2025-05-07T20:32:14.6410229Z self, 2025-05-07T20:32:14.6410432Z T: int, 2025-05-07T20:32:14.6410631Z D: int, 2025-05-07T20:32:14.6410857Z scale_ub: Optional[float], 2025-05-07T20:32:14.6411144Z contiguous: bool, 2025-05-07T20:32:14.6411386Z compiled: bool, 2025-05-07T20:32:14.6411618Z ) -> None: 2025-05-07T20:32:14.6411844Z torch.manual_seed(2025) 2025-05-07T20:32:14.6412088Z 2025-05-07T20:32:14.6412373Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.6412743Z 2025-05-07T20:32:14.6412938Z x_sign = torch.sign(x) 2025-05-07T20:32:14.6413237Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:14.6413561Z x = x_sign * x_clamp 2025-05-07T20:32:14.6413804Z x0 = x[:, :D] 2025-05-07T20:32:14.6414026Z x1 = x[:, D:] 2025-05-07T20:32:14.6414242Z 2025-05-07T20:32:14.6414429Z if contiguous: 2025-05-07T20:32:14.6414658Z x0 = x0.contiguous() 2025-05-07T20:32:14.6414926Z x1 = x1.contiguous() 2025-05-07T20:32:14.6415171Z 2025-05-07T20:32:14.6415355Z if scale_ub is not None: 2025-05-07T20:32:14.6415636Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:14.6415984Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:14.6416300Z ) 2025-05-07T20:32:14.6416497Z else: 2025-05-07T20:32:14.6416714Z scale_ub_tensor = None 2025-05-07T20:32:14.6416968Z 2025-05-07T20:32:14.6417204Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:14.6417535Z op = silu_mul_quant 2025-05-07T20:32:14.6417786Z if compiled: 2025-05-07T20:32:14.6418041Z op = torch.compile(op) 2025-05-07T20:32:14.6418348Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.6418631Z 2025-05-07T20:32:14.6418827Z > y_fp8, y_scale = fn() 2025-05-07T20:32:14.6419000Z 2025-05-07T20:32:14.6419105Z moe/activation_test.py:117: 2025-05-07T20:32:14.6419496Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.6419894Z moe/activation_test.py:115: in fn 2025-05-07T20:32:14.6420184Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.6420925Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:14.6421669Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:14.6422241Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:14.6422977Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:14.6423691Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:14.6424258Z kernel = self.compile( 2025-05-07T20:32:14.6424838Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:14.6425538Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:14.6425948Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.6426200Z 2025-05-07T20:32:14.6426415Z self = 2025-05-07T20:32:14.6427588Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:14.6429098Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1ee0208dc0>} 2025-05-07T20:32:14.6430806Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:14.6431911Z context = 2025-05-07T20:32:14.6432222Z 2025-05-07T20:32:14.6432391Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:14.6432943Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:14.6433433Z module_map=module_map) 2025-05-07T20:32:14.6433806Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:14.6434168Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:14.6434434Z E ^ 2025-05-07T20:32:14.6434920Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:14.6435419Z 2025-05-07T20:32:14.6435868Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:14.6436432Z 2025-05-07T20:32:14.6436536Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.6436966Z self=, 2025-05-07T20:32:14.6437383Z T=4096, 2025-05-07T20:32:14.6437568Z D=5120, 2025-05-07T20:32:14.6437760Z scale_ub=1200.0, 2025-05-07T20:32:14.6437979Z contiguous=True, 2025-05-07T20:32:14.6438205Z compiled=False, 2025-05-07T20:32:14.6438411Z ) 2025-05-07T20:32:14.6438730Z self = 2025-05-07T20:32:14.6439253Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:14.6439543Z 2025-05-07T20:32:14.6439629Z @given( 2025-05-07T20:32:14.6439855Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.6440187Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.6440505Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.6440939Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.6441273Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.6441574Z ) 2025-05-07T20:32:14.6441936Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.6442397Z def test_silu_mul_quant( 2025-05-07T20:32:14.6442640Z self, 2025-05-07T20:32:14.6442837Z T: int, 2025-05-07T20:32:14.6443027Z D: int, 2025-05-07T20:32:14.6443245Z scale_ub: Optional[float], 2025-05-07T20:32:14.6443522Z contiguous: bool, 2025-05-07T20:32:14.6443758Z compiled: bool, 2025-05-07T20:32:14.6443980Z ) -> None: 2025-05-07T20:32:14.6444194Z torch.manual_seed(2025) 2025-05-07T20:32:14.6444438Z 2025-05-07T20:32:14.6444714Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.6445077Z 2025-05-07T20:32:14.6445268Z x_sign = torch.sign(x) 2025-05-07T20:32:14.6445562Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:14.6445880Z x = x_sign * x_clamp 2025-05-07T20:32:14.6446127Z x0 = x[:, :D] 2025-05-07T20:32:14.6446343Z x1 = x[:, D:] 2025-05-07T20:32:14.6446554Z 2025-05-07T20:32:14.6446740Z if contiguous: 2025-05-07T20:32:14.6446967Z x0 = x0.contiguous() 2025-05-07T20:32:14.6447236Z x1 = x1.contiguous() 2025-05-07T20:32:14.6447478Z 2025-05-07T20:32:14.6447664Z if scale_ub is not None: 2025-05-07T20:32:14.6447945Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:14.6448292Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:14.6448605Z ) 2025-05-07T20:32:14.6448795Z else: 2025-05-07T20:32:14.6449010Z scale_ub_tensor = None 2025-05-07T20:32:14.6449354Z 2025-05-07T20:32:14.6449591Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:14.6449918Z op = silu_mul_quant 2025-05-07T20:32:14.6450166Z if compiled: 2025-05-07T20:32:14.6450421Z op = torch.compile(op) 2025-05-07T20:32:14.6450722Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.6451004Z 2025-05-07T20:32:14.6451189Z > y_fp8, y_scale = fn() 2025-05-07T20:32:14.6451363Z 2025-05-07T20:32:14.6451460Z moe/activation_test.py:117: 2025-05-07T20:32:14.6451766Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.6452107Z moe/activation_test.py:115: in fn 2025-05-07T20:32:14.6452397Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.6453138Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:14.6453885Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:14.6454457Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:14.6455193Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:14.6455904Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:14.6456465Z kernel = self.compile( 2025-05-07T20:32:14.6457036Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:14.6457738Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:14.6458160Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.6458402Z 2025-05-07T20:32:14.6458619Z self = 2025-05-07T20:32:14.6459793Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:14.6463267Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1ee1fe4dc0>} 2025-05-07T20:32:14.6464743Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:14.6465846Z context = 2025-05-07T20:32:14.6466158Z 2025-05-07T20:32:14.6466328Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:14.6466877Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:14.6467370Z module_map=module_map) 2025-05-07T20:32:14.6467747Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:14.6468112Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:14.6468380Z E ^ 2025-05-07T20:32:14.6468871Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:14.6469364Z 2025-05-07T20:32:14.6469883Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:14.6470449Z 2025-05-07T20:32:14.6470553Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.6470983Z self=, 2025-05-07T20:32:14.6471399Z T=1, 2025-05-07T20:32:14.6471582Z D=5120, 2025-05-07T20:32:14.6471778Z scale_ub=None, 2025-05-07T20:32:14.6471991Z contiguous=True, 2025-05-07T20:32:14.6472214Z compiled=True, 2025-05-07T20:32:14.6472511Z ) 2025-05-07T20:32:14.6472833Z self = 2025-05-07T20:32:14.6473338Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:14.6473614Z 2025-05-07T20:32:14.6473694Z @given( 2025-05-07T20:32:14.6473923Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.6474240Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.6474560Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.6474904Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.6475237Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.6475533Z ) 2025-05-07T20:32:14.6475898Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.6476365Z def test_silu_mul_quant( 2025-05-07T20:32:14.6476607Z self, 2025-05-07T20:32:14.6476804Z T: int, 2025-05-07T20:32:14.6477000Z D: int, 2025-05-07T20:32:14.6477220Z scale_ub: Optional[float], 2025-05-07T20:32:14.6477313Z contiguous: bool, 2025-05-07T20:32:14.6477401Z compiled: bool, 2025-05-07T20:32:14.6477478Z ) -> None: 2025-05-07T20:32:14.6477581Z torch.manual_seed(2025) 2025-05-07T20:32:14.6477651Z 2025-05-07T20:32:14.6477823Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.6477904Z 2025-05-07T20:32:14.6477994Z x_sign = torch.sign(x) 2025-05-07T20:32:14.6478117Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:14.6478215Z x = x_sign * x_clamp 2025-05-07T20:32:14.6478296Z x0 = x[:, :D] 2025-05-07T20:32:14.6478374Z x1 = x[:, D:] 2025-05-07T20:32:14.6478449Z 2025-05-07T20:32:14.6478531Z if contiguous: 2025-05-07T20:32:14.6478627Z x0 = x0.contiguous() 2025-05-07T20:32:14.6478714Z x1 = x1.contiguous() 2025-05-07T20:32:14.6478787Z 2025-05-07T20:32:14.6478881Z if scale_ub is not None: 2025-05-07T20:32:14.6478988Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:14.6479123Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:14.6479206Z ) 2025-05-07T20:32:14.6479363Z else: 2025-05-07T20:32:14.6479459Z scale_ub_tensor = None 2025-05-07T20:32:14.6479540Z 2025-05-07T20:32:14.6479671Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:14.6479761Z op = silu_mul_quant 2025-05-07T20:32:14.6479853Z if compiled: 2025-05-07T20:32:14.6479950Z op = torch.compile(op) 2025-05-07T20:32:14.6480059Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.6480127Z 2025-05-07T20:32:14.6480218Z y_fp8, y_scale = fn() 2025-05-07T20:32:14.6480347Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:14.6480420Z 2025-05-07T20:32:14.6480555Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:14.6480662Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:14.6480778Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:14.6480900Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:14.6481047Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:14.6481126Z 2025-05-07T20:32:14.6481224Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:14.6481229Z 2025-05-07T20:32:14.6481333Z moe/activation_test.py:126: 2025-05-07T20:32:14.6481464Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.6481568Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:14.6481713Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:14.6482321Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:14.6482421Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:14.6483171Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:14.6483449Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:14.6483850Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:14.6484119Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:14.6484548Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:14.6484821Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:14.6485223Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:14.6485400Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:14.6485772Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:14.6485845Z fn() 2025-05-07T20:32:14.6486285Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:14.6486367Z self.fn.run( 2025-05-07T20:32:14.6486728Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:14.6486826Z kernel = self.compile( 2025-05-07T20:32:14.6487234Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:14.6487418Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:14.6487548Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.6487553Z 2025-05-07T20:32:14.6487765Z self = 2025-05-07T20:32:14.6488774Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:14.6489411Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1ee021f700>} 2025-05-07T20:32:14.6490360Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:14.6490575Z context = 2025-05-07T20:32:14.6490580Z 2025-05-07T20:32:14.6490769Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:14.6491083Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:14.6491199Z module_map=module_map) 2025-05-07T20:32:14.6491385Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:14.6491497Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:14.6491575Z E ^ 2025-05-07T20:32:14.6492014Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:14.6492019Z 2025-05-07T20:32:14.6492525Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:14.6492530Z 2025-05-07T20:32:14.6492647Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.6492906Z self=, 2025-05-07T20:32:14.6492984Z T=2048, 2025-05-07T20:32:14.6493066Z D=5120, 2025-05-07T20:32:14.6493148Z scale_ub=None, 2025-05-07T20:32:14.6493353Z contiguous=True, 2025-05-07T20:32:14.6493443Z compiled=True, 2025-05-07T20:32:14.6493518Z ) 2025-05-07T20:32:14.6493770Z self = 2025-05-07T20:32:14.6493973Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:14.6493978Z 2025-05-07T20:32:14.6494055Z @given( 2025-05-07T20:32:14.6494189Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.6494291Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.6494414Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.6494545Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.6494664Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.6494740Z ) 2025-05-07T20:32:14.6495040Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.6495137Z def test_silu_mul_quant( 2025-05-07T20:32:14.6495219Z self, 2025-05-07T20:32:14.6495302Z T: int, 2025-05-07T20:32:14.6495380Z D: int, 2025-05-07T20:32:14.6495488Z scale_ub: Optional[float], 2025-05-07T20:32:14.6495580Z contiguous: bool, 2025-05-07T20:32:14.6495673Z compiled: bool, 2025-05-07T20:32:14.6495758Z ) -> None: 2025-05-07T20:32:14.6495857Z torch.manual_seed(2025) 2025-05-07T20:32:14.6495931Z 2025-05-07T20:32:14.6496124Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.6496198Z 2025-05-07T20:32:14.6496292Z x_sign = torch.sign(x) 2025-05-07T20:32:14.6496437Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:14.6496527Z x = x_sign * x_clamp 2025-05-07T20:32:14.6496608Z x0 = x[:, :D] 2025-05-07T20:32:14.6496693Z x1 = x[:, D:] 2025-05-07T20:32:14.6496766Z 2025-05-07T20:32:14.6496854Z if contiguous: 2025-05-07T20:32:14.6496951Z x0 = x0.contiguous() 2025-05-07T20:32:14.6497042Z x1 = x1.contiguous() 2025-05-07T20:32:14.6497132Z 2025-05-07T20:32:14.6497230Z if scale_ub is not None: 2025-05-07T20:32:14.6497339Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:14.6497574Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:14.6497647Z ) 2025-05-07T20:32:14.6497720Z else: 2025-05-07T20:32:14.6497818Z scale_ub_tensor = None 2025-05-07T20:32:14.6497893Z 2025-05-07T20:32:14.6498024Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:14.6498119Z op = silu_mul_quant 2025-05-07T20:32:14.6498202Z if compiled: 2025-05-07T20:32:14.6498307Z op = torch.compile(op) 2025-05-07T20:32:14.6498412Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.6498484Z 2025-05-07T20:32:14.6498578Z y_fp8, y_scale = fn() 2025-05-07T20:32:14.6498698Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:14.6498775Z 2025-05-07T20:32:14.6498915Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:14.6499016Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:14.6499114Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:14.6499254Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:14.6499394Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:14.6499467Z 2025-05-07T20:32:14.6499575Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:14.6499579Z 2025-05-07T20:32:14.6499676Z moe/activation_test.py:126: 2025-05-07T20:32:14.6499811Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.6499918Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:14.6500054Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:14.6500670Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:14.6500877Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:14.6501262Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:14.6501504Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:14.6501897Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:14.6502170Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:14.6502599Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:14.6502865Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:14.6503271Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:14.6503447Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:14.6503816Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:14.6503898Z fn() 2025-05-07T20:32:14.6504331Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:14.6504419Z self.fn.run( 2025-05-07T20:32:14.6504778Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:14.6504868Z kernel = self.compile( 2025-05-07T20:32:14.6505282Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:14.6505461Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:14.6505596Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.6505605Z 2025-05-07T20:32:14.6505816Z self = 2025-05-07T20:32:14.6506769Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:14.6507323Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1d7dcbe790>} 2025-05-07T20:32:14.6508135Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:14.6508337Z context = 2025-05-07T20:32:14.6508341Z 2025-05-07T20:32:14.6508507Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:14.6508793Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:14.6508907Z module_map=module_map) 2025-05-07T20:32:14.6509070Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:14.6509175Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:14.6509252Z E ^ 2025-05-07T20:32:14.6509635Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:14.6509639Z 2025-05-07T20:32:14.6510185Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:14.6510190Z 2025-05-07T20:32:14.6510290Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.6510529Z self=, 2025-05-07T20:32:14.6510606Z T=128, 2025-05-07T20:32:14.6510771Z D=5120, 2025-05-07T20:32:14.6510860Z scale_ub=None, 2025-05-07T20:32:14.6510944Z contiguous=True, 2025-05-07T20:32:14.6511026Z compiled=True, 2025-05-07T20:32:14.6511101Z ) 2025-05-07T20:32:14.6511333Z self = 2025-05-07T20:32:14.6511505Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:14.6511519Z 2025-05-07T20:32:14.6511596Z @given( 2025-05-07T20:32:14.6511715Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.6511816Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.6511928Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.6512043Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.6512158Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.6512232Z ) 2025-05-07T20:32:14.6512493Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.6512596Z def test_silu_mul_quant( 2025-05-07T20:32:14.6512674Z self, 2025-05-07T20:32:14.6512750Z T: int, 2025-05-07T20:32:14.6512827Z D: int, 2025-05-07T20:32:14.6512926Z scale_ub: Optional[float], 2025-05-07T20:32:14.6513016Z contiguous: bool, 2025-05-07T20:32:14.6513096Z compiled: bool, 2025-05-07T20:32:14.6513174Z ) -> None: 2025-05-07T20:32:14.6513272Z torch.manual_seed(2025) 2025-05-07T20:32:14.6513341Z 2025-05-07T20:32:14.6513512Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.6513585Z 2025-05-07T20:32:14.6513674Z x_sign = torch.sign(x) 2025-05-07T20:32:14.6513794Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:14.6513886Z x = x_sign * x_clamp 2025-05-07T20:32:14.6513966Z x0 = x[:, :D] 2025-05-07T20:32:14.6514044Z x1 = x[:, D:] 2025-05-07T20:32:14.6514119Z 2025-05-07T20:32:14.6514201Z if contiguous: 2025-05-07T20:32:14.6514299Z x0 = x0.contiguous() 2025-05-07T20:32:14.6514386Z x1 = x1.contiguous() 2025-05-07T20:32:14.6514460Z 2025-05-07T20:32:14.6514553Z if scale_ub is not None: 2025-05-07T20:32:14.6514743Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:14.6514880Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:14.6514958Z ) 2025-05-07T20:32:14.6515032Z else: 2025-05-07T20:32:14.6515125Z scale_ub_tensor = None 2025-05-07T20:32:14.6515202Z 2025-05-07T20:32:14.6515332Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:14.6515420Z op = silu_mul_quant 2025-05-07T20:32:14.6515509Z if compiled: 2025-05-07T20:32:14.6515607Z op = torch.compile(op) 2025-05-07T20:32:14.6515720Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.6515789Z 2025-05-07T20:32:14.6515878Z y_fp8, y_scale = fn() 2025-05-07T20:32:14.6516008Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:14.6516081Z 2025-05-07T20:32:14.6516217Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:14.6516328Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:14.6516426Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:14.6516549Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:14.6516696Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:14.6516766Z 2025-05-07T20:32:14.6516868Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:14.6516879Z 2025-05-07T20:32:14.6516975Z moe/activation_test.py:126: 2025-05-07T20:32:14.6517106Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.6517214Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:14.6517348Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:14.6517956Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:14.6518222Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:14.6518612Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:14.6518850Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:14.6519243Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:14.6519510Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:14.6519944Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:14.6520209Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:14.6520617Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:14.6520792Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:14.6521158Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:14.6521242Z fn() 2025-05-07T20:32:14.6521672Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:14.6521753Z self.fn.run( 2025-05-07T20:32:14.6522114Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:14.6522206Z kernel = self.compile( 2025-05-07T20:32:14.6522614Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:14.6522797Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:14.6522930Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.6522935Z 2025-05-07T20:32:14.6523150Z self = 2025-05-07T20:32:14.6524083Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:14.6524638Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1d7db5ea60>} 2025-05-07T20:32:14.6525448Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:14.6525644Z context = 2025-05-07T20:32:14.6525652Z 2025-05-07T20:32:14.6525827Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:14.6526106Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:14.6526221Z module_map=module_map) 2025-05-07T20:32:14.6526383Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:14.6526484Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:14.6526568Z E ^ 2025-05-07T20:32:14.6526951Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:14.6526956Z 2025-05-07T20:32:14.6527400Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:14.6527409Z 2025-05-07T20:32:14.6527508Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.6527739Z self=, 2025-05-07T20:32:14.6527899Z T=4096, 2025-05-07T20:32:14.6527973Z D=5120, 2025-05-07T20:32:14.6528054Z scale_ub=None, 2025-05-07T20:32:14.6528145Z contiguous=True, 2025-05-07T20:32:14.6528232Z compiled=True, 2025-05-07T20:32:14.6528302Z ) 2025-05-07T20:32:14.6528532Z self = 2025-05-07T20:32:14.6528708Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:14.6528713Z 2025-05-07T20:32:14.6528794Z @given( 2025-05-07T20:32:14.6528911Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.6529010Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.6529130Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.6529245Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.6529357Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.6529435Z ) 2025-05-07T20:32:14.6529700Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.6529793Z def test_silu_mul_quant( 2025-05-07T20:32:14.6529871Z self, 2025-05-07T20:32:14.6529953Z T: int, 2025-05-07T20:32:14.6530026Z D: int, 2025-05-07T20:32:14.6530129Z scale_ub: Optional[float], 2025-05-07T20:32:14.6530216Z contiguous: bool, 2025-05-07T20:32:14.6530301Z compiled: bool, 2025-05-07T20:32:14.6530374Z ) -> None: 2025-05-07T20:32:14.6530466Z torch.manual_seed(2025) 2025-05-07T20:32:14.6530539Z 2025-05-07T20:32:14.6530712Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.6530783Z 2025-05-07T20:32:14.6530876Z x_sign = torch.sign(x) 2025-05-07T20:32:14.6530999Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:14.6531087Z x = x_sign * x_clamp 2025-05-07T20:32:14.6531178Z x0 = x[:, :D] 2025-05-07T20:32:14.6531263Z x1 = x[:, D:] 2025-05-07T20:32:14.6531334Z 2025-05-07T20:32:14.6531420Z if contiguous: 2025-05-07T20:32:14.6531513Z x0 = x0.contiguous() 2025-05-07T20:32:14.6531599Z x1 = x1.contiguous() 2025-05-07T20:32:14.6531756Z 2025-05-07T20:32:14.6531849Z if scale_ub is not None: 2025-05-07T20:32:14.6531958Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:14.6532090Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:14.6532163Z ) 2025-05-07T20:32:14.6532243Z else: 2025-05-07T20:32:14.6532334Z scale_ub_tensor = None 2025-05-07T20:32:14.6532402Z 2025-05-07T20:32:14.6532534Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:14.6532623Z op = silu_mul_quant 2025-05-07T20:32:14.6532705Z if compiled: 2025-05-07T20:32:14.6532813Z op = torch.compile(op) 2025-05-07T20:32:14.6532917Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.6532989Z 2025-05-07T20:32:14.6533083Z y_fp8, y_scale = fn() 2025-05-07T20:32:14.6533201Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:14.6533279Z 2025-05-07T20:32:14.6533417Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:14.6533517Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:14.6533621Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:14.6533739Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:14.6533879Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:14.6533956Z 2025-05-07T20:32:14.6534052Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:14.6534057Z 2025-05-07T20:32:14.6534153Z moe/activation_test.py:126: 2025-05-07T20:32:14.6534289Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.6534391Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:14.6534638Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:14.6535252Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:14.6535351Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:14.6535740Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:14.6535971Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:14.6536368Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:14.6536634Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:14.6537063Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:14.6537333Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:14.6537740Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:14.6537914Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:14.6538288Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:14.6538365Z fn() 2025-05-07T20:32:14.6538797Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:14.6538878Z self.fn.run( 2025-05-07T20:32:14.6539238Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:14.6539338Z kernel = self.compile( 2025-05-07T20:32:14.6539744Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:14.6539930Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:14.6540059Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.6540063Z 2025-05-07T20:32:14.6540356Z self = 2025-05-07T20:32:14.6541212Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:14.6541758Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1d7d2f1700>} 2025-05-07T20:32:14.6542576Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:14.6542772Z context = 2025-05-07T20:32:14.6542777Z 2025-05-07T20:32:14.6542947Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:14.6543226Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:14.6543333Z module_map=module_map) 2025-05-07T20:32:14.6543499Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:14.6543598Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:14.6543673Z E ^ 2025-05-07T20:32:14.6544055Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:14.6544060Z 2025-05-07T20:32:14.6544508Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:14.6544513Z 2025-05-07T20:32:14.6544701Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.6544931Z self=, 2025-05-07T20:32:14.6545007Z T=16384, 2025-05-07T20:32:14.6545092Z D=5120, 2025-05-07T20:32:14.6545171Z scale_ub=None, 2025-05-07T20:32:14.6545255Z contiguous=True, 2025-05-07T20:32:14.6545341Z compiled=True, 2025-05-07T20:32:14.6545412Z ) 2025-05-07T20:32:14.6545637Z self = 2025-05-07T20:32:14.6545816Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:14.6545821Z 2025-05-07T20:32:14.6545897Z @given( 2025-05-07T20:32:14.6546016Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.6546124Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.6546239Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.6546359Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.6546477Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.6546548Z ) 2025-05-07T20:32:14.6546808Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.6546903Z def test_silu_mul_quant( 2025-05-07T20:32:14.6546981Z self, 2025-05-07T20:32:14.6547058Z T: int, 2025-05-07T20:32:14.6547132Z D: int, 2025-05-07T20:32:14.6547227Z scale_ub: Optional[float], 2025-05-07T20:32:14.6547317Z contiguous: bool, 2025-05-07T20:32:14.6547401Z compiled: bool, 2025-05-07T20:32:14.6547480Z ) -> None: 2025-05-07T20:32:14.6547573Z torch.manual_seed(2025) 2025-05-07T20:32:14.6547644Z 2025-05-07T20:32:14.6547823Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.6547894Z 2025-05-07T20:32:14.6547984Z x_sign = torch.sign(x) 2025-05-07T20:32:14.6548113Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:14.6548206Z x = x_sign * x_clamp 2025-05-07T20:32:14.6548286Z x0 = x[:, :D] 2025-05-07T20:32:14.6552217Z x1 = x[:, D:] 2025-05-07T20:32:14.6552305Z 2025-05-07T20:32:14.6552400Z if contiguous: 2025-05-07T20:32:14.6552606Z x0 = x0.contiguous() 2025-05-07T20:32:14.6552702Z x1 = x1.contiguous() 2025-05-07T20:32:14.6552775Z 2025-05-07T20:32:14.6552866Z if scale_ub is not None: 2025-05-07T20:32:14.6552978Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:14.6553115Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:14.6553187Z ) 2025-05-07T20:32:14.6553266Z else: 2025-05-07T20:32:14.6553360Z scale_ub_tensor = None 2025-05-07T20:32:14.6553429Z 2025-05-07T20:32:14.6553565Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:14.6553657Z op = silu_mul_quant 2025-05-07T20:32:14.6553745Z if compiled: 2025-05-07T20:32:14.6553850Z op = torch.compile(op) 2025-05-07T20:32:14.6553959Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.6554039Z 2025-05-07T20:32:14.6554132Z y_fp8, y_scale = fn() 2025-05-07T20:32:14.6554258Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:14.6554335Z 2025-05-07T20:32:14.6554475Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:14.6554576Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:14.6554680Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:14.6554803Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:14.6554942Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:14.6555016Z 2025-05-07T20:32:14.6555116Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:14.6555121Z 2025-05-07T20:32:14.6555220Z moe/activation_test.py:126: 2025-05-07T20:32:14.6555355Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.6555542Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:14.6555682Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:14.6556306Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:14.6556407Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:14.6556801Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:14.6557034Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:14.6557431Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:14.6557697Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:14.6558126Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:14.6558399Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:14.6558804Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:14.6558981Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:14.6559343Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:14.6559421Z fn() 2025-05-07T20:32:14.6559855Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:14.6559937Z self.fn.run( 2025-05-07T20:32:14.6560293Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:14.6560392Z kernel = self.compile( 2025-05-07T20:32:14.6560798Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:14.6560982Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:14.6561187Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.6561193Z 2025-05-07T20:32:14.6561404Z self = 2025-05-07T20:32:14.6562256Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:14.6562802Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1d7db55700>} 2025-05-07T20:32:14.6563623Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:14.6563823Z context = 2025-05-07T20:32:14.6563831Z 2025-05-07T20:32:14.6564001Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:14.6564283Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:14.6564386Z module_map=module_map) 2025-05-07T20:32:14.6564554Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:14.6564657Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:14.6564738Z E ^ 2025-05-07T20:32:14.6565124Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:14.6565129Z 2025-05-07T20:32:14.6565574Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:14.6565656Z 2025-05-07T20:32:14.6565762Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.6565999Z self=, 2025-05-07T20:32:14.6566077Z T=1, 2025-05-07T20:32:14.6566157Z D=5120, 2025-05-07T20:32:14.6566236Z scale_ub=1200.0, 2025-05-07T20:32:14.6566316Z contiguous=True, 2025-05-07T20:32:14.6566399Z compiled=True, 2025-05-07T20:32:14.6566469Z ) 2025-05-07T20:32:14.6566692Z self = 2025-05-07T20:32:14.6566868Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:14.6566873Z 2025-05-07T20:32:14.6566948Z @given( 2025-05-07T20:32:14.6567068Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.6567163Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.6567278Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.6567396Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.6567511Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.6567585Z ) 2025-05-07T20:32:14.6567849Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.6567940Z def test_silu_mul_quant( 2025-05-07T20:32:14.6568013Z self, 2025-05-07T20:32:14.6568092Z T: int, 2025-05-07T20:32:14.6568167Z D: int, 2025-05-07T20:32:14.6568263Z scale_ub: Optional[float], 2025-05-07T20:32:14.6568350Z contiguous: bool, 2025-05-07T20:32:14.6568433Z compiled: bool, 2025-05-07T20:32:14.6568507Z ) -> None: 2025-05-07T20:32:14.6568603Z torch.manual_seed(2025) 2025-05-07T20:32:14.6568675Z 2025-05-07T20:32:14.6568847Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.6568923Z 2025-05-07T20:32:14.6569015Z x_sign = torch.sign(x) 2025-05-07T20:32:14.6569150Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:14.6569236Z x = x_sign * x_clamp 2025-05-07T20:32:14.6569314Z x0 = x[:, :D] 2025-05-07T20:32:14.6569399Z x1 = x[:, D:] 2025-05-07T20:32:14.6569577Z 2025-05-07T20:32:14.6569660Z if contiguous: 2025-05-07T20:32:14.6569755Z x0 = x0.contiguous() 2025-05-07T20:32:14.6569842Z x1 = x1.contiguous() 2025-05-07T20:32:14.6569914Z 2025-05-07T20:32:14.6570004Z if scale_ub is not None: 2025-05-07T20:32:14.6570106Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:14.6570243Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:14.6570319Z ) 2025-05-07T20:32:14.6570392Z else: 2025-05-07T20:32:14.6570487Z scale_ub_tensor = None 2025-05-07T20:32:14.6570559Z 2025-05-07T20:32:14.6570685Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:14.6570779Z op = silu_mul_quant 2025-05-07T20:32:14.6570867Z if compiled: 2025-05-07T20:32:14.6570965Z op = torch.compile(op) 2025-05-07T20:32:14.6571075Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.6571154Z 2025-05-07T20:32:14.6571243Z > y_fp8, y_scale = fn() 2025-05-07T20:32:14.6571248Z 2025-05-07T20:32:14.6571346Z moe/activation_test.py:117: 2025-05-07T20:32:14.6571476Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.6571578Z moe/activation_test.py:115: in fn 2025-05-07T20:32:14.6571674Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.6572067Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:14.6572160Z return fn(*args, **kwargs) 2025-05-07T20:32:14.6572694Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:14.6572872Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:14.6573254Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:14.6573492Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:14.6573855Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:14.6573947Z kernel = self.compile( 2025-05-07T20:32:14.6574356Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:14.6574535Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:14.6574662Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.6574667Z 2025-05-07T20:32:14.6574874Z self = 2025-05-07T20:32:14.6575727Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:14.6576281Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1d7dabf5e0>} 2025-05-07T20:32:14.6577093Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:14.6577290Z context = 2025-05-07T20:32:14.6577295Z 2025-05-07T20:32:14.6577465Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:14.6577740Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:14.6577849Z module_map=module_map) 2025-05-07T20:32:14.6578011Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:14.6578106Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:14.6578256Z E ^ 2025-05-07T20:32:14.6578643Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:14.6578648Z 2025-05-07T20:32:14.6579091Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:14.6579096Z 2025-05-07T20:32:14.6579199Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.6579428Z self=, 2025-05-07T20:32:14.6579501Z T=1, 2025-05-07T20:32:14.6579574Z D=5120, 2025-05-07T20:32:14.6579655Z scale_ub=None, 2025-05-07T20:32:14.6579740Z contiguous=False, 2025-05-07T20:32:14.6579824Z compiled=True, 2025-05-07T20:32:14.6579899Z ) 2025-05-07T20:32:14.6580136Z self = 2025-05-07T20:32:14.6580305Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:14.6580313Z 2025-05-07T20:32:14.6580388Z @given( 2025-05-07T20:32:14.6580514Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.6580610Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.6580723Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.6580841Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.6580955Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.6581027Z ) 2025-05-07T20:32:14.6581289Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.6581380Z def test_silu_mul_quant( 2025-05-07T20:32:14.6581462Z self, 2025-05-07T20:32:14.6581533Z T: int, 2025-05-07T20:32:14.6581690Z D: int, 2025-05-07T20:32:14.6581790Z scale_ub: Optional[float], 2025-05-07T20:32:14.6581876Z contiguous: bool, 2025-05-07T20:32:14.6581958Z compiled: bool, 2025-05-07T20:32:14.6582037Z ) -> None: 2025-05-07T20:32:14.6582132Z torch.manual_seed(2025) 2025-05-07T20:32:14.6582203Z 2025-05-07T20:32:14.6582377Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.6582447Z 2025-05-07T20:32:14.6582535Z x_sign = torch.sign(x) 2025-05-07T20:32:14.6582659Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:14.6582945Z x = x_sign * x_clamp 2025-05-07T20:32:14.6583079Z x0 = x[:, :D] 2025-05-07T20:32:14.6583185Z x1 = x[:, D:] 2025-05-07T20:32:14.6583258Z 2025-05-07T20:32:14.6583347Z if contiguous: 2025-05-07T20:32:14.6583434Z x0 = x0.contiguous() 2025-05-07T20:32:14.6583521Z x1 = x1.contiguous() 2025-05-07T20:32:14.6583597Z 2025-05-07T20:32:14.6583691Z if scale_ub is not None: 2025-05-07T20:32:14.6583793Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:14.6583931Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:14.6584007Z ) 2025-05-07T20:32:14.6584085Z else: 2025-05-07T20:32:14.6584180Z scale_ub_tensor = None 2025-05-07T20:32:14.6584250Z 2025-05-07T20:32:14.6584382Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:14.6584469Z op = silu_mul_quant 2025-05-07T20:32:14.6584549Z if compiled: 2025-05-07T20:32:14.6584646Z op = torch.compile(op) 2025-05-07T20:32:14.6584750Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.6584821Z 2025-05-07T20:32:14.6584914Z y_fp8, y_scale = fn() 2025-05-07T20:32:14.6585034Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:14.6585104Z 2025-05-07T20:32:14.6585244Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:14.6585350Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:14.6585450Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:14.6585730Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:14.6585877Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:14.6585953Z 2025-05-07T20:32:14.6586052Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:14.6586056Z 2025-05-07T20:32:14.6586152Z moe/activation_test.py:126: 2025-05-07T20:32:14.6586283Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.6586385Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:14.6586518Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:14.6587128Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:14.6587224Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:14.6587615Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:14.6587848Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:14.6588238Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:14.6588511Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:14.6588936Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:14.6589201Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:14.6589600Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:14.6589822Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:14.6590309Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:14.6590385Z fn() 2025-05-07T20:32:14.6590816Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:14.6590902Z self.fn.run( 2025-05-07T20:32:14.6591258Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:14.6591352Z kernel = self.compile( 2025-05-07T20:32:14.6591755Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:14.6591932Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:14.6592063Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.6592067Z 2025-05-07T20:32:14.6592277Z self = 2025-05-07T20:32:14.6593141Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:14.6593689Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1d7cdb9af0>} 2025-05-07T20:32:14.6594496Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:14.6594693Z context = 2025-05-07T20:32:14.6594698Z 2025-05-07T20:32:14.6594863Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:14.6595141Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:14.6595250Z module_map=module_map) 2025-05-07T20:32:14.6595491Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:14.6595596Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:14.6595672Z E ^ 2025-05-07T20:32:14.6596053Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:14.6596060Z 2025-05-07T20:32:14.6596502Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:14.6596507Z 2025-05-07T20:32:14.6596606Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.6596837Z self=, 2025-05-07T20:32:14.6596908Z T=1, 2025-05-07T20:32:14.6596979Z D=5120, 2025-05-07T20:32:14.6597059Z scale_ub=None, 2025-05-07T20:32:14.6597146Z contiguous=True, 2025-05-07T20:32:14.6597228Z compiled=False, 2025-05-07T20:32:14.6597307Z ) 2025-05-07T20:32:14.6597534Z self = 2025-05-07T20:32:14.6597701Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:14.6597706Z 2025-05-07T20:32:14.6597780Z @given( 2025-05-07T20:32:14.6597896Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.6597997Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.6598111Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.6598225Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.6598341Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.6598414Z ) 2025-05-07T20:32:14.6598670Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.6598763Z def test_silu_mul_quant( 2025-05-07T20:32:14.6598922Z self, 2025-05-07T20:32:14.6599000Z T: int, 2025-05-07T20:32:14.6599072Z D: int, 2025-05-07T20:32:14.6599170Z scale_ub: Optional[float], 2025-05-07T20:32:14.6599261Z contiguous: bool, 2025-05-07T20:32:14.6599346Z compiled: bool, 2025-05-07T20:32:14.6599419Z ) -> None: 2025-05-07T20:32:14.6599517Z torch.manual_seed(2025) 2025-05-07T20:32:14.6599591Z 2025-05-07T20:32:14.6599760Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.6599840Z 2025-05-07T20:32:14.6599930Z x_sign = torch.sign(x) 2025-05-07T20:32:14.6600051Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:14.6600142Z x = x_sign * x_clamp 2025-05-07T20:32:14.6600219Z x0 = x[:, :D] 2025-05-07T20:32:14.6600298Z x1 = x[:, D:] 2025-05-07T20:32:14.6600368Z 2025-05-07T20:32:14.6600451Z if contiguous: 2025-05-07T20:32:14.6600545Z x0 = x0.contiguous() 2025-05-07T20:32:14.6600640Z x1 = x1.contiguous() 2025-05-07T20:32:14.6600712Z 2025-05-07T20:32:14.6600803Z if scale_ub is not None: 2025-05-07T20:32:14.6600905Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:14.6601042Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:14.6601127Z ) 2025-05-07T20:32:14.6601202Z else: 2025-05-07T20:32:14.6601293Z scale_ub_tensor = None 2025-05-07T20:32:14.6601373Z 2025-05-07T20:32:14.6601501Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:14.6601591Z op = silu_mul_quant 2025-05-07T20:32:14.6601672Z if compiled: 2025-05-07T20:32:14.6601771Z op = torch.compile(op) 2025-05-07T20:32:14.6601876Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.6601945Z 2025-05-07T20:32:14.6602031Z > y_fp8, y_scale = fn() 2025-05-07T20:32:14.6602035Z 2025-05-07T20:32:14.6602136Z moe/activation_test.py:117: 2025-05-07T20:32:14.6602271Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.6602367Z moe/activation_test.py:115: in fn 2025-05-07T20:32:14.6602573Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.6603113Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:14.6603213Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:14.6603596Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:14.6603827Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:14.6604198Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:14.6604290Z kernel = self.compile( 2025-05-07T20:32:14.6604697Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:14.6604882Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:14.6605014Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.6605018Z 2025-05-07T20:32:14.6605231Z self = 2025-05-07T20:32:14.6606074Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:14.6606624Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1d7d2f18b0>} 2025-05-07T20:32:14.6607434Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:14.6607705Z context = 2025-05-07T20:32:14.6607710Z 2025-05-07T20:32:14.6607889Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:14.6608163Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:14.6608274Z module_map=module_map) 2025-05-07T20:32:14.6608434Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:14.6608530Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:14.6608603Z E ^ 2025-05-07T20:32:14.6608981Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:14.6608986Z 2025-05-07T20:32:14.6609428Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:14.6609445Z 2025-05-07T20:32:14.6609544Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.6609772Z self=, 2025-05-07T20:32:14.6609855Z T=128, 2025-05-07T20:32:14.6609933Z D=5120, 2025-05-07T20:32:14.6610010Z scale_ub=None, 2025-05-07T20:32:14.6610094Z contiguous=False, 2025-05-07T20:32:14.6610173Z compiled=True, 2025-05-07T20:32:14.6610246Z ) 2025-05-07T20:32:14.6610472Z self = 2025-05-07T20:32:14.6610644Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:14.6610649Z 2025-05-07T20:32:14.6610723Z @given( 2025-05-07T20:32:14.6610841Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.6610939Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.6611062Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.6611177Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.6611294Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.6611367Z ) 2025-05-07T20:32:14.6611706Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.6611801Z def test_silu_mul_quant( 2025-05-07T20:32:14.6611883Z self, 2025-05-07T20:32:14.6611957Z T: int, 2025-05-07T20:32:14.6612030Z D: int, 2025-05-07T20:32:14.6612128Z scale_ub: Optional[float], 2025-05-07T20:32:14.6612212Z contiguous: bool, 2025-05-07T20:32:14.6612300Z compiled: bool, 2025-05-07T20:32:14.6612374Z ) -> None: 2025-05-07T20:32:14.6612463Z torch.manual_seed(2025) 2025-05-07T20:32:14.6612539Z 2025-05-07T20:32:14.6612709Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.6612782Z 2025-05-07T20:32:14.6612878Z x_sign = torch.sign(x) 2025-05-07T20:32:14.6613003Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:14.6613092Z x = x_sign * x_clamp 2025-05-07T20:32:14.6613177Z x0 = x[:, :D] 2025-05-07T20:32:14.6613256Z x1 = x[:, D:] 2025-05-07T20:32:14.6613328Z 2025-05-07T20:32:14.6613425Z if contiguous: 2025-05-07T20:32:14.6613514Z x0 = x0.contiguous() 2025-05-07T20:32:14.6613601Z x1 = x1.contiguous() 2025-05-07T20:32:14.6613677Z 2025-05-07T20:32:14.6613764Z if scale_ub is not None: 2025-05-07T20:32:14.6613869Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:14.6614005Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:14.6614077Z ) 2025-05-07T20:32:14.6614154Z else: 2025-05-07T20:32:14.6614247Z scale_ub_tensor = None 2025-05-07T20:32:14.6614327Z 2025-05-07T20:32:14.6614457Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:14.6614554Z op = silu_mul_quant 2025-05-07T20:32:14.6614634Z if compiled: 2025-05-07T20:32:14.6614819Z op = torch.compile(op) 2025-05-07T20:32:14.6614925Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.6614997Z 2025-05-07T20:32:14.6615093Z > y_fp8, y_scale = fn() 2025-05-07T20:32:14.6615098Z 2025-05-07T20:32:14.6615192Z moe/activation_test.py:117: 2025-05-07T20:32:14.6615323Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.6615425Z moe/activation_test.py:115: in fn 2025-05-07T20:32:14.6615522Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.6615911Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:14.6616006Z return fn(*args, **kwargs) 2025-05-07T20:32:14.6616539Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:14.6616634Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:14.6617018Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:14.6617254Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:14.6617619Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:14.6617708Z kernel = self.compile( 2025-05-07T20:32:14.6618115Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:14.6618297Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:14.6618425Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.6618429Z 2025-05-07T20:32:14.6618640Z self = 2025-05-07T20:32:14.6619484Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:14.6620124Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1d7cd1ee50>} 2025-05-07T20:32:14.6620936Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:14.6621134Z context = 2025-05-07T20:32:14.6621138Z 2025-05-07T20:32:14.6621309Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:14.6621579Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:14.6621691Z module_map=module_map) 2025-05-07T20:32:14.6621855Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:14.6621952Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:14.6622030Z E ^ 2025-05-07T20:32:14.6622416Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:14.6622421Z 2025-05-07T20:32:14.6622863Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:14.6622867Z 2025-05-07T20:32:14.6622970Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.6623198Z self=, 2025-05-07T20:32:14.6623276Z T=128, 2025-05-07T20:32:14.6623351Z D=7168, 2025-05-07T20:32:14.6623429Z scale_ub=1200.0, 2025-05-07T20:32:14.6623517Z contiguous=False, 2025-05-07T20:32:14.6623598Z compiled=False, 2025-05-07T20:32:14.6623668Z ) 2025-05-07T20:32:14.6623976Z self = 2025-05-07T20:32:14.6624151Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:14.6624156Z 2025-05-07T20:32:14.6624235Z @given( 2025-05-07T20:32:14.6624352Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.6624448Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.6624570Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.6624683Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.6624794Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.6624870Z ) 2025-05-07T20:32:14.6625125Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.6625216Z def test_silu_mul_quant( 2025-05-07T20:32:14.6625296Z self, 2025-05-07T20:32:14.6625371Z T: int, 2025-05-07T20:32:14.6625444Z D: int, 2025-05-07T20:32:14.6625550Z scale_ub: Optional[float], 2025-05-07T20:32:14.6625639Z contiguous: bool, 2025-05-07T20:32:14.6625721Z compiled: bool, 2025-05-07T20:32:14.6625800Z ) -> None: 2025-05-07T20:32:14.6625894Z torch.manual_seed(2025) 2025-05-07T20:32:14.6625967Z 2025-05-07T20:32:14.6626138Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.6626209Z 2025-05-07T20:32:14.6626304Z x_sign = torch.sign(x) 2025-05-07T20:32:14.6626424Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:14.6626511Z x = x_sign * x_clamp 2025-05-07T20:32:14.6626594Z x0 = x[:, :D] 2025-05-07T20:32:14.6626674Z x1 = x[:, D:] 2025-05-07T20:32:14.6626744Z 2025-05-07T20:32:14.6626830Z if contiguous: 2025-05-07T20:32:14.6626918Z x0 = x0.contiguous() 2025-05-07T20:32:14.6627006Z x1 = x1.contiguous() 2025-05-07T20:32:14.6627083Z 2025-05-07T20:32:14.6627172Z if scale_ub is not None: 2025-05-07T20:32:14.6627281Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:14.6627414Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:14.6627485Z ) 2025-05-07T20:32:14.6627649Z else: 2025-05-07T20:32:14.6627744Z scale_ub_tensor = None 2025-05-07T20:32:14.6627814Z 2025-05-07T20:32:14.6627943Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:14.6628031Z op = silu_mul_quant 2025-05-07T20:32:14.6628111Z if compiled: 2025-05-07T20:32:14.6628211Z op = torch.compile(op) 2025-05-07T20:32:14.6628315Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.6628387Z 2025-05-07T20:32:14.6628478Z > y_fp8, y_scale = fn() 2025-05-07T20:32:14.6628482Z 2025-05-07T20:32:14.6628576Z moe/activation_test.py:117: 2025-05-07T20:32:14.6628708Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.6628806Z moe/activation_test.py:115: in fn 2025-05-07T20:32:14.6628907Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.6629453Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:14.6629549Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:14.6629989Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:14.6630224Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:14.6630585Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:14.6630683Z kernel = self.compile( 2025-05-07T20:32:14.6631090Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:14.6631267Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:14.6631509Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.6631513Z 2025-05-07T20:32:14.6631725Z self = 2025-05-07T20:32:14.6632580Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:14.6633122Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1d7cdb9c10>} 2025-05-07T20:32:14.6633934Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:14.6634132Z context = 2025-05-07T20:32:14.6634140Z 2025-05-07T20:32:14.6634312Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:14.6634593Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:14.6634697Z module_map=module_map) 2025-05-07T20:32:14.6634857Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:14.6634955Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:14.6635029Z E ^ 2025-05-07T20:32:14.6635410Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:14.6635415Z 2025-05-07T20:32:14.6635857Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:14.6635862Z 2025-05-07T20:32:14.6635962Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.6636195Z self=, 2025-05-07T20:32:14.6636275Z T=128, 2025-05-07T20:32:14.6636346Z D=5120, 2025-05-07T20:32:14.6636429Z scale_ub=None, 2025-05-07T20:32:14.6636591Z contiguous=False, 2025-05-07T20:32:14.6636680Z compiled=False, 2025-05-07T20:32:14.6636748Z ) 2025-05-07T20:32:14.6636973Z self = 2025-05-07T20:32:14.6637150Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:14.6637154Z 2025-05-07T20:32:14.6637230Z @given( 2025-05-07T20:32:14.6637344Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.6637443Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.6637556Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.6637669Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.6637784Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.6637861Z ) 2025-05-07T20:32:14.6638118Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.6638211Z def test_silu_mul_quant( 2025-05-07T20:32:14.6638287Z self, 2025-05-07T20:32:14.6638368Z T: int, 2025-05-07T20:32:14.6638443Z D: int, 2025-05-07T20:32:14.6638539Z scale_ub: Optional[float], 2025-05-07T20:32:14.6638630Z contiguous: bool, 2025-05-07T20:32:14.6638712Z compiled: bool, 2025-05-07T20:32:14.6638787Z ) -> None: 2025-05-07T20:32:14.6638888Z torch.manual_seed(2025) 2025-05-07T20:32:14.6638962Z 2025-05-07T20:32:14.6639131Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.6639209Z 2025-05-07T20:32:14.6639300Z x_sign = torch.sign(x) 2025-05-07T20:32:14.6639425Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:14.6639511Z x = x_sign * x_clamp 2025-05-07T20:32:14.6639587Z x0 = x[:, :D] 2025-05-07T20:32:14.6639755Z x1 = x[:, D:] 2025-05-07T20:32:14.6639826Z 2025-05-07T20:32:14.6639903Z if contiguous: 2025-05-07T20:32:14.6639994Z x0 = x0.contiguous() 2025-05-07T20:32:14.6640085Z x1 = x1.contiguous() 2025-05-07T20:32:14.6640155Z 2025-05-07T20:32:14.6640249Z if scale_ub is not None: 2025-05-07T20:32:14.6640350Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:14.6640486Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:14.6640558Z ) 2025-05-07T20:32:14.6640631Z else: 2025-05-07T20:32:14.6640724Z scale_ub_tensor = None 2025-05-07T20:32:14.6640797Z 2025-05-07T20:32:14.6640922Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:14.6641012Z op = silu_mul_quant 2025-05-07T20:32:14.6641092Z if compiled: 2025-05-07T20:32:14.6641189Z op = torch.compile(op) 2025-05-07T20:32:14.6641294Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.6641370Z 2025-05-07T20:32:14.6641455Z > y_fp8, y_scale = fn() 2025-05-07T20:32:14.6641459Z 2025-05-07T20:32:14.6641554Z moe/activation_test.py:117: 2025-05-07T20:32:14.6641686Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.6641783Z moe/activation_test.py:115: in fn 2025-05-07T20:32:14.6641882Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.6642422Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:14.6642524Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:14.6642901Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:14.6643130Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:14.6643494Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:14.6643587Z kernel = self.compile( 2025-05-07T20:32:14.6643998Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:14.6644258Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:14.6644387Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.6644391Z 2025-05-07T20:32:14.6644602Z self = 2025-05-07T20:32:14.6645444Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:14.6645990Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1d7cd8da60>} 2025-05-07T20:32:14.6646808Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:14.6647003Z context = 2025-05-07T20:32:14.6647008Z 2025-05-07T20:32:14.6647178Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:14.6647450Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:14.6647559Z module_map=module_map) 2025-05-07T20:32:14.6647720Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:14.6647818Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:14.6647897Z E ^ 2025-05-07T20:32:14.6648275Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:14.6648358Z 2025-05-07T20:32:14.6648799Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:14.6648808Z 2025-05-07T20:32:14.6648912Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.6649143Z self=, 2025-05-07T20:32:14.6649221Z T=128, 2025-05-07T20:32:14.6649298Z D=5120, 2025-05-07T20:32:14.6649379Z scale_ub=1200.0, 2025-05-07T20:32:14.6649466Z contiguous=True, 2025-05-07T20:32:14.6649551Z compiled=False, 2025-05-07T20:32:14.6649622Z ) 2025-05-07T20:32:14.6649849Z self = 2025-05-07T20:32:14.6650022Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:14.6650026Z 2025-05-07T20:32:14.6650106Z @given( 2025-05-07T20:32:14.6650224Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.6650328Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.6650445Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.6650560Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.6650676Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.6650752Z ) 2025-05-07T20:32:14.6651005Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.6651094Z def test_silu_mul_quant( 2025-05-07T20:32:14.6651172Z self, 2025-05-07T20:32:14.6651249Z T: int, 2025-05-07T20:32:14.6651323Z D: int, 2025-05-07T20:32:14.6651422Z scale_ub: Optional[float], 2025-05-07T20:32:14.6651507Z contiguous: bool, 2025-05-07T20:32:14.6651592Z compiled: bool, 2025-05-07T20:32:14.6651666Z ) -> None: 2025-05-07T20:32:14.6651755Z torch.manual_seed(2025) 2025-05-07T20:32:14.6651829Z 2025-05-07T20:32:14.6652000Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.6652074Z 2025-05-07T20:32:14.6652168Z x_sign = torch.sign(x) 2025-05-07T20:32:14.6652289Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:14.6653117Z x = x_sign * x_clamp 2025-05-07T20:32:14.6653204Z x0 = x[:, :D] 2025-05-07T20:32:14.6653283Z x1 = x[:, D:] 2025-05-07T20:32:14.6653355Z 2025-05-07T20:32:14.6653439Z if contiguous: 2025-05-07T20:32:14.6653528Z x0 = x0.contiguous() 2025-05-07T20:32:14.6653621Z x1 = x1.contiguous() 2025-05-07T20:32:14.6653691Z 2025-05-07T20:32:14.6653779Z if scale_ub is not None: 2025-05-07T20:32:14.6653885Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:14.6654019Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:14.6654089Z ) 2025-05-07T20:32:14.6654168Z else: 2025-05-07T20:32:14.6654257Z scale_ub_tensor = None 2025-05-07T20:32:14.6654332Z 2025-05-07T20:32:14.6654462Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:14.6654549Z op = silu_mul_quant 2025-05-07T20:32:14.6654630Z if compiled: 2025-05-07T20:32:14.6654737Z op = torch.compile(op) 2025-05-07T20:32:14.6654840Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.6654914Z 2025-05-07T20:32:14.6655002Z > y_fp8, y_scale = fn() 2025-05-07T20:32:14.6655007Z 2025-05-07T20:32:14.6655098Z moe/activation_test.py:117: 2025-05-07T20:32:14.6655232Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.6655333Z moe/activation_test.py:115: in fn 2025-05-07T20:32:14.6655428Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.6655970Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:14.6656067Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:14.6656590Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:14.6656821Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:14.6657186Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:14.6657280Z kernel = self.compile( 2025-05-07T20:32:14.6657686Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:14.6657862Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:14.6657993Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.6657997Z 2025-05-07T20:32:14.6658206Z self = 2025-05-07T20:32:14.6659052Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:14.6659604Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1d7cd8f550>} 2025-05-07T20:32:14.6660415Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:14.6660608Z context = 2025-05-07T20:32:14.6660613Z 2025-05-07T20:32:14.6660778Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:14.6661053Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:14.6661159Z module_map=module_map) 2025-05-07T20:32:14.6661325Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:14.6661431Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:14.6661503Z E ^ 2025-05-07T20:32:14.6661965Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:14.6661970Z 2025-05-07T20:32:14.6662414Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:14.6662418Z 2025-05-07T20:32:14.6662518Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.6662752Z self=, 2025-05-07T20:32:14.6662824Z T=1, 2025-05-07T20:32:14.6662904Z D=7168, 2025-05-07T20:32:14.6662985Z scale_ub=1200.0, 2025-05-07T20:32:14.6663067Z contiguous=True, 2025-05-07T20:32:14.6663150Z compiled=True, 2025-05-07T20:32:14.6663221Z ) 2025-05-07T20:32:14.6663451Z self = 2025-05-07T20:32:14.6663621Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:14.6663625Z 2025-05-07T20:32:14.6663704Z @given( 2025-05-07T20:32:14.6663821Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.6663920Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.6664031Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.6664146Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.6664254Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.6664324Z ) 2025-05-07T20:32:14.6664581Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.6664672Z def test_silu_mul_quant( 2025-05-07T20:32:14.6664747Z self, 2025-05-07T20:32:14.6664823Z T: int, 2025-05-07T20:32:14.6664895Z D: int, 2025-05-07T20:32:14.6664990Z scale_ub: Optional[float], 2025-05-07T20:32:14.6665186Z contiguous: bool, 2025-05-07T20:32:14.6665271Z compiled: bool, 2025-05-07T20:32:14.6665347Z ) -> None: 2025-05-07T20:32:14.6665444Z torch.manual_seed(2025) 2025-05-07T20:32:14.6665519Z 2025-05-07T20:32:14.6665688Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.6665763Z 2025-05-07T20:32:14.6665851Z x_sign = torch.sign(x) 2025-05-07T20:32:14.6665975Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:14.6666062Z x = x_sign * x_clamp 2025-05-07T20:32:14.6666140Z x0 = x[:, :D] 2025-05-07T20:32:14.6666220Z x1 = x[:, D:] 2025-05-07T20:32:14.6666289Z 2025-05-07T20:32:14.6666369Z if contiguous: 2025-05-07T20:32:14.6666463Z x0 = x0.contiguous() 2025-05-07T20:32:14.6666551Z x1 = x1.contiguous() 2025-05-07T20:32:14.6666623Z 2025-05-07T20:32:14.6666717Z if scale_ub is not None: 2025-05-07T20:32:14.6666824Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:14.6666955Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:14.6667032Z ) 2025-05-07T20:32:14.6667110Z else: 2025-05-07T20:32:14.6667204Z scale_ub_tensor = None 2025-05-07T20:32:14.6667274Z 2025-05-07T20:32:14.6667402Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:14.6667493Z op = silu_mul_quant 2025-05-07T20:32:14.6667577Z if compiled: 2025-05-07T20:32:14.6667675Z op = torch.compile(op) 2025-05-07T20:32:14.6667781Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.6667846Z 2025-05-07T20:32:14.6667934Z > y_fp8, y_scale = fn() 2025-05-07T20:32:14.6667939Z 2025-05-07T20:32:14.6668037Z moe/activation_test.py:117: 2025-05-07T20:32:14.6668166Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.6668271Z moe/activation_test.py:115: in fn 2025-05-07T20:32:14.6668375Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.6668765Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:14.6668945Z return fn(*args, **kwargs) 2025-05-07T20:32:14.6669482Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:14.6669579Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:14.6670020Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:14.6670251Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:14.6670614Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:14.6670702Z kernel = self.compile( 2025-05-07T20:32:14.6671108Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:14.6671297Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:14.6671432Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.6671437Z 2025-05-07T20:32:14.6671646Z self = 2025-05-07T20:32:14.6672492Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:14.6673034Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1d7cd8d160>} 2025-05-07T20:32:14.6673842Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:14.6674121Z context = 2025-05-07T20:32:14.6674130Z 2025-05-07T20:32:14.6674303Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:14.6674578Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:14.6674683Z module_map=module_map) 2025-05-07T20:32:14.6674849Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:14.6674945Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:14.6675018Z E ^ 2025-05-07T20:32:14.6675395Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:14.6675400Z 2025-05-07T20:32:14.6675844Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:14.6675853Z 2025-05-07T20:32:14.6675956Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.6676189Z self=, 2025-05-07T20:32:14.6676264Z T=1, 2025-05-07T20:32:14.6676342Z D=7168, 2025-05-07T20:32:14.6676418Z scale_ub=1200.0, 2025-05-07T20:32:14.6676498Z contiguous=False, 2025-05-07T20:32:14.6676581Z compiled=True, 2025-05-07T20:32:14.6676652Z ) 2025-05-07T20:32:14.6676883Z self = 2025-05-07T20:32:14.6677054Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:14.6677058Z 2025-05-07T20:32:14.6677130Z @given( 2025-05-07T20:32:14.6677250Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.6677349Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.6677462Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.6677588Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.6682027Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.6682120Z ) 2025-05-07T20:32:14.6682498Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.6682597Z def test_silu_mul_quant( 2025-05-07T20:32:14.6682673Z self, 2025-05-07T20:32:14.6682975Z T: int, 2025-05-07T20:32:14.6683095Z D: int, 2025-05-07T20:32:14.6683223Z scale_ub: Optional[float], 2025-05-07T20:32:14.6683309Z contiguous: bool, 2025-05-07T20:32:14.6683393Z compiled: bool, 2025-05-07T20:32:14.6683473Z ) -> None: 2025-05-07T20:32:14.6683563Z torch.manual_seed(2025) 2025-05-07T20:32:14.6683634Z 2025-05-07T20:32:14.6683810Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.6683888Z 2025-05-07T20:32:14.6683978Z x_sign = torch.sign(x) 2025-05-07T20:32:14.6684111Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:14.6684200Z x = x_sign * x_clamp 2025-05-07T20:32:14.6684278Z x0 = x[:, :D] 2025-05-07T20:32:14.6684358Z x1 = x[:, D:] 2025-05-07T20:32:14.6684440Z 2025-05-07T20:32:14.6684519Z if contiguous: 2025-05-07T20:32:14.6684609Z x0 = x0.contiguous() 2025-05-07T20:32:14.6684705Z x1 = x1.contiguous() 2025-05-07T20:32:14.6684774Z 2025-05-07T20:32:14.6684863Z if scale_ub is not None: 2025-05-07T20:32:14.6684969Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:14.6685105Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:14.6685181Z ) 2025-05-07T20:32:14.6685259Z else: 2025-05-07T20:32:14.6685351Z scale_ub_tensor = None 2025-05-07T20:32:14.6685426Z 2025-05-07T20:32:14.6685558Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:14.6685646Z op = silu_mul_quant 2025-05-07T20:32:14.6685885Z if compiled: 2025-05-07T20:32:14.6685983Z op = torch.compile(op) 2025-05-07T20:32:14.6686087Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.6686162Z 2025-05-07T20:32:14.6686254Z > y_fp8, y_scale = fn() 2025-05-07T20:32:14.6686259Z 2025-05-07T20:32:14.6686354Z moe/activation_test.py:117: 2025-05-07T20:32:14.6686490Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.6686589Z moe/activation_test.py:115: in fn 2025-05-07T20:32:14.6686690Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.6687084Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:14.6687180Z return fn(*args, **kwargs) 2025-05-07T20:32:14.6687723Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:14.6687820Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:14.6688207Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:14.6688445Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:14.6688808Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:14.6688903Z kernel = self.compile( 2025-05-07T20:32:14.6689313Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:14.6689494Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:14.6689629Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.6689633Z 2025-05-07T20:32:14.6689843Z self = 2025-05-07T20:32:14.6690700Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:14.6691368Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1d7c6c1670>} 2025-05-07T20:32:14.6692184Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:14.6692385Z context = 2025-05-07T20:32:14.6692390Z 2025-05-07T20:32:14.6692558Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:14.6692838Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:14.6692950Z module_map=module_map) 2025-05-07T20:32:14.6693112Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:14.6693211Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:14.6693291Z E ^ 2025-05-07T20:32:14.6693678Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:14.6693683Z 2025-05-07T20:32:14.6694128Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:14.6694133Z 2025-05-07T20:32:14.6694235Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.6694469Z self=, 2025-05-07T20:32:14.6694544Z T=1, 2025-05-07T20:32:14.6694616Z D=7168, 2025-05-07T20:32:14.6694700Z scale_ub=None, 2025-05-07T20:32:14.6694782Z contiguous=False, 2025-05-07T20:32:14.6694866Z compiled=True, 2025-05-07T20:32:14.6695020Z ) 2025-05-07T20:32:14.6695246Z self = 2025-05-07T20:32:14.6695416Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:14.6695425Z 2025-05-07T20:32:14.6695499Z @given( 2025-05-07T20:32:14.6695618Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.6695717Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.6695832Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.6695948Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.6696064Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.6696139Z ) 2025-05-07T20:32:14.6696397Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.6696495Z def test_silu_mul_quant( 2025-05-07T20:32:14.6696569Z self, 2025-05-07T20:32:14.6696650Z T: int, 2025-05-07T20:32:14.6696724Z D: int, 2025-05-07T20:32:14.6696825Z scale_ub: Optional[float], 2025-05-07T20:32:14.6696915Z contiguous: bool, 2025-05-07T20:32:14.6696999Z compiled: bool, 2025-05-07T20:32:14.6697077Z ) -> None: 2025-05-07T20:32:14.6697175Z torch.manual_seed(2025) 2025-05-07T20:32:14.6697247Z 2025-05-07T20:32:14.6697419Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.6697492Z 2025-05-07T20:32:14.6697580Z x_sign = torch.sign(x) 2025-05-07T20:32:14.6697702Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:14.6697792Z x = x_sign * x_clamp 2025-05-07T20:32:14.6697871Z x0 = x[:, :D] 2025-05-07T20:32:14.6697949Z x1 = x[:, D:] 2025-05-07T20:32:14.6698025Z 2025-05-07T20:32:14.6698102Z if contiguous: 2025-05-07T20:32:14.6698195Z x0 = x0.contiguous() 2025-05-07T20:32:14.6698282Z x1 = x1.contiguous() 2025-05-07T20:32:14.6698349Z 2025-05-07T20:32:14.6698445Z if scale_ub is not None: 2025-05-07T20:32:14.6698545Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:14.6698679Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:14.6698836Z ) 2025-05-07T20:32:14.6698914Z else: 2025-05-07T20:32:14.6699006Z scale_ub_tensor = None 2025-05-07T20:32:14.6699086Z 2025-05-07T20:32:14.6699214Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:14.6699300Z op = silu_mul_quant 2025-05-07T20:32:14.6699388Z if compiled: 2025-05-07T20:32:14.6699488Z op = torch.compile(op) 2025-05-07T20:32:14.6699594Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.6699667Z 2025-05-07T20:32:14.6699755Z y_fp8, y_scale = fn() 2025-05-07T20:32:14.6699878Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:14.6699949Z 2025-05-07T20:32:14.6700083Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:14.6700189Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:14.6700285Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:14.6700406Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:14.6700555Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:14.6700629Z 2025-05-07T20:32:14.6700727Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:14.6700735Z 2025-05-07T20:32:14.6700830Z moe/activation_test.py:126: 2025-05-07T20:32:14.6700959Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.6701064Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:14.6701197Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:14.6701804Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:14.6701905Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:14.6702397Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:14.6702635Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:14.6703028Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:14.6703293Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:14.6703724Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:14.6703986Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:14.6704385Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:14.6704558Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:14.6704925Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:14.6705004Z fn() 2025-05-07T20:32:14.6705433Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:14.6705514Z self.fn.run( 2025-05-07T20:32:14.6705872Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:14.6705965Z kernel = self.compile( 2025-05-07T20:32:14.6706370Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:14.6706551Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:14.6706679Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.6706683Z 2025-05-07T20:32:14.6706896Z self = 2025-05-07T20:32:14.6707826Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:14.6708382Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1d7c188430>} 2025-05-07T20:32:14.6709194Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:14.6709390Z context = 2025-05-07T20:32:14.6709394Z 2025-05-07T20:32:14.6709568Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:14.6709921Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:14.6710037Z module_map=module_map) 2025-05-07T20:32:14.6710204Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:14.6710307Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:14.6710382Z E ^ 2025-05-07T20:32:14.6710756Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:14.6710761Z 2025-05-07T20:32:14.6711205Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:14.6711213Z 2025-05-07T20:32:14.6711315Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.6711544Z self=, 2025-05-07T20:32:14.6711624Z T=1, 2025-05-07T20:32:14.6711700Z D=5120, 2025-05-07T20:32:14.6711782Z scale_ub=1200.0, 2025-05-07T20:32:14.6711952Z contiguous=False, 2025-05-07T20:32:14.6712033Z compiled=True, 2025-05-07T20:32:14.6712104Z ) 2025-05-07T20:32:14.6712330Z self = 2025-05-07T20:32:14.6712503Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:14.6712507Z 2025-05-07T20:32:14.6712583Z @given( 2025-05-07T20:32:14.6712699Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.6712799Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.6712918Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.6713032Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.6713142Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.6713220Z ) 2025-05-07T20:32:14.6713476Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.6713565Z def test_silu_mul_quant( 2025-05-07T20:32:14.6713653Z self, 2025-05-07T20:32:14.6713728Z T: int, 2025-05-07T20:32:14.6713802Z D: int, 2025-05-07T20:32:14.6713902Z scale_ub: Optional[float], 2025-05-07T20:32:14.6713989Z contiguous: bool, 2025-05-07T20:32:14.6714081Z compiled: bool, 2025-05-07T20:32:14.6714159Z ) -> None: 2025-05-07T20:32:14.6714252Z torch.manual_seed(2025) 2025-05-07T20:32:14.6714327Z 2025-05-07T20:32:14.6714497Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.6714568Z 2025-05-07T20:32:14.6714658Z x_sign = torch.sign(x) 2025-05-07T20:32:14.6714780Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:14.6714870Z x = x_sign * x_clamp 2025-05-07T20:32:14.6714957Z x0 = x[:, :D] 2025-05-07T20:32:14.6715036Z x1 = x[:, D:] 2025-05-07T20:32:14.6715109Z 2025-05-07T20:32:14.6715196Z if contiguous: 2025-05-07T20:32:14.6715285Z x0 = x0.contiguous() 2025-05-07T20:32:14.6715382Z x1 = x1.contiguous() 2025-05-07T20:32:14.6715454Z 2025-05-07T20:32:14.6715540Z if scale_ub is not None: 2025-05-07T20:32:14.6715647Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:14.6715864Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:14.6715939Z ) 2025-05-07T20:32:14.6716017Z else: 2025-05-07T20:32:14.6716110Z scale_ub_tensor = None 2025-05-07T20:32:14.6716182Z 2025-05-07T20:32:14.6716313Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:14.6716401Z op = silu_mul_quant 2025-05-07T20:32:14.6716485Z if compiled: 2025-05-07T20:32:14.6716591Z op = torch.compile(op) 2025-05-07T20:32:14.6716694Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.6716770Z 2025-05-07T20:32:14.6716860Z > y_fp8, y_scale = fn() 2025-05-07T20:32:14.6716865Z 2025-05-07T20:32:14.6716959Z moe/activation_test.py:117: 2025-05-07T20:32:14.6717093Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.6717190Z moe/activation_test.py:115: in fn 2025-05-07T20:32:14.6717287Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.6717687Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:14.6717778Z return fn(*args, **kwargs) 2025-05-07T20:32:14.6718314Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:14.6718413Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:14.6718794Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:14.6719027Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:14.6719391Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:14.6719564Z kernel = self.compile( 2025-05-07T20:32:14.6719975Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:14.6720158Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:14.6720293Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.6720297Z 2025-05-07T20:32:14.6720508Z self = 2025-05-07T20:32:14.6721355Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:14.6721902Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1d7c188e50>} 2025-05-07T20:32:14.6722718Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:14.6722917Z context = 2025-05-07T20:32:14.6722922Z 2025-05-07T20:32:14.6723089Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:14.6723365Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:14.6723473Z module_map=module_map) 2025-05-07T20:32:14.6723635Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:14.6723735Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:14.6723810Z E ^ 2025-05-07T20:32:14.6724191Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:14.6724201Z 2025-05-07T20:32:14.6724650Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:14.6724654Z 2025-05-07T20:32:14.6724835Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.6725070Z self=, 2025-05-07T20:32:14.6725145Z T=1, 2025-05-07T20:32:14.6725220Z D=5120, 2025-05-07T20:32:14.6725309Z scale_ub=1200.0, 2025-05-07T20:32:14.6725395Z contiguous=False, 2025-05-07T20:32:14.6725475Z compiled=False, 2025-05-07T20:32:14.6725547Z ) 2025-05-07T20:32:14.6725770Z self = 2025-05-07T20:32:14.6725942Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:14.6725946Z 2025-05-07T20:32:14.6726024Z @given( 2025-05-07T20:32:14.6726141Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.6726248Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.6726364Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.6726477Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.6726598Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.6726670Z ) 2025-05-07T20:32:14.6726926Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.6727019Z def test_silu_mul_quant( 2025-05-07T20:32:14.6727093Z self, 2025-05-07T20:32:14.6727169Z T: int, 2025-05-07T20:32:14.6727250Z D: int, 2025-05-07T20:32:14.6727346Z scale_ub: Optional[float], 2025-05-07T20:32:14.6727430Z contiguous: bool, 2025-05-07T20:32:14.6727518Z compiled: bool, 2025-05-07T20:32:14.6727594Z ) -> None: 2025-05-07T20:32:14.6727689Z torch.manual_seed(2025) 2025-05-07T20:32:14.6727759Z 2025-05-07T20:32:14.6727927Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.6728087Z 2025-05-07T20:32:14.6728177Z x_sign = torch.sign(x) 2025-05-07T20:32:14.6728301Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:14.6728397Z x = x_sign * x_clamp 2025-05-07T20:32:14.6728479Z x0 = x[:, :D] 2025-05-07T20:32:14.6728560Z x1 = x[:, D:] 2025-05-07T20:32:14.6728638Z 2025-05-07T20:32:14.6728718Z if contiguous: 2025-05-07T20:32:14.6728806Z x0 = x0.contiguous() 2025-05-07T20:32:14.6728897Z x1 = x1.contiguous() 2025-05-07T20:32:14.6728972Z 2025-05-07T20:32:14.6729060Z if scale_ub is not None: 2025-05-07T20:32:14.6729168Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:14.6729307Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:14.6729387Z ) 2025-05-07T20:32:14.6729464Z else: 2025-05-07T20:32:14.6729556Z scale_ub_tensor = None 2025-05-07T20:32:14.6729635Z 2025-05-07T20:32:14.6729762Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:14.6729850Z op = silu_mul_quant 2025-05-07T20:32:14.6729935Z if compiled: 2025-05-07T20:32:14.6730035Z op = torch.compile(op) 2025-05-07T20:32:14.6730138Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.6730213Z 2025-05-07T20:32:14.6730303Z > y_fp8, y_scale = fn() 2025-05-07T20:32:14.6730308Z 2025-05-07T20:32:14.6730407Z moe/activation_test.py:117: 2025-05-07T20:32:14.6730535Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.6730634Z moe/activation_test.py:115: in fn 2025-05-07T20:32:14.6730735Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.6731274Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:14.6731368Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:14.6731758Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:14.6731990Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:14.6732436Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:14.6732529Z kernel = self.compile( 2025-05-07T20:32:14.6732935Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:14.6733112Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:14.6733238Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.6733242Z 2025-05-07T20:32:14.6733453Z self = 2025-05-07T20:32:14.6734303Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:14.6734859Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1cefb79820>} 2025-05-07T20:32:14.6735672Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:14.6735865Z context = 2025-05-07T20:32:14.6735870Z 2025-05-07T20:32:14.6736041Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:14.6736313Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:14.6736418Z module_map=module_map) 2025-05-07T20:32:14.6736683Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:14.6736780Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:14.6736857Z E ^ 2025-05-07T20:32:14.6737243Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:14.6737248Z 2025-05-07T20:32:14.6737692Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:14.6737696Z 2025-05-07T20:32:14.6737800Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.6738029Z self=, 2025-05-07T20:32:14.6738105Z T=16384, 2025-05-07T20:32:14.6738183Z D=5120, 2025-05-07T20:32:14.6738262Z scale_ub=1200.0, 2025-05-07T20:32:14.6738348Z contiguous=False, 2025-05-07T20:32:14.6738432Z compiled=True, 2025-05-07T20:32:14.6738505Z ) 2025-05-07T20:32:14.6738737Z self = 2025-05-07T20:32:14.6738919Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:14.6738923Z 2025-05-07T20:32:14.6739005Z @given( 2025-05-07T20:32:14.6739127Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.6739224Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.6739338Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.6739455Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.6739566Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.6739644Z ) 2025-05-07T20:32:14.6739901Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.6739991Z def test_silu_mul_quant( 2025-05-07T20:32:14.6740068Z self, 2025-05-07T20:32:14.6740141Z T: int, 2025-05-07T20:32:14.6740216Z D: int, 2025-05-07T20:32:14.6740333Z scale_ub: Optional[float], 2025-05-07T20:32:14.6740431Z contiguous: bool, 2025-05-07T20:32:14.6740533Z compiled: bool, 2025-05-07T20:32:14.6740621Z ) -> None: 2025-05-07T20:32:14.6740793Z torch.manual_seed(2025) 2025-05-07T20:32:14.6740870Z 2025-05-07T20:32:14.6741047Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.6741120Z 2025-05-07T20:32:14.6741209Z x_sign = torch.sign(x) 2025-05-07T20:32:14.6741334Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:14.6741424Z x = x_sign * x_clamp 2025-05-07T20:32:14.6741507Z x0 = x[:, :D] 2025-05-07T20:32:14.6741587Z x1 = x[:, D:] 2025-05-07T20:32:14.6741659Z 2025-05-07T20:32:14.6741740Z if contiguous: 2025-05-07T20:32:14.6741831Z x0 = x0.contiguous() 2025-05-07T20:32:14.6741916Z x1 = x1.contiguous() 2025-05-07T20:32:14.6741989Z 2025-05-07T20:32:14.6742079Z if scale_ub is not None: 2025-05-07T20:32:14.6742191Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:14.6742327Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:14.6742403Z ) 2025-05-07T20:32:14.6742491Z else: 2025-05-07T20:32:14.6742589Z scale_ub_tensor = None 2025-05-07T20:32:14.6742658Z 2025-05-07T20:32:14.6742785Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:14.6742877Z op = silu_mul_quant 2025-05-07T20:32:14.6742958Z if compiled: 2025-05-07T20:32:14.6743057Z op = torch.compile(op) 2025-05-07T20:32:14.6743161Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.6743234Z 2025-05-07T20:32:14.6743322Z > y_fp8, y_scale = fn() 2025-05-07T20:32:14.6743326Z 2025-05-07T20:32:14.6743427Z moe/activation_test.py:117: 2025-05-07T20:32:14.6743559Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.6743658Z moe/activation_test.py:115: in fn 2025-05-07T20:32:14.6743837Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.6744228Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:14.6744325Z return fn(*args, **kwargs) 2025-05-07T20:32:14.6744862Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:14.6744958Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:14.6745343Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:14.6745574Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:14.6745939Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:14.6746030Z kernel = self.compile( 2025-05-07T20:32:14.6746436Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:14.6746624Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:14.6746756Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.6746761Z 2025-05-07T20:32:14.6746970Z self = 2025-05-07T20:32:14.6747820Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:14.6748367Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1d7c137790>} 2025-05-07T20:32:14.6749183Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:14.6749381Z context = 2025-05-07T20:32:14.6749465Z 2025-05-07T20:32:14.6749636Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:14.6750022Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:14.6750128Z module_map=module_map) 2025-05-07T20:32:14.6750293Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:14.6750399Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:14.6750491Z E ^ 2025-05-07T20:32:14.6750898Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:14.6750903Z 2025-05-07T20:32:14.6751346Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:14.6751355Z 2025-05-07T20:32:14.6751459Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.6751694Z self=, 2025-05-07T20:32:14.6751771Z T=2048, 2025-05-07T20:32:14.6751849Z D=7168, 2025-05-07T20:32:14.6751929Z scale_ub=1200.0, 2025-05-07T20:32:14.6752013Z contiguous=False, 2025-05-07T20:32:14.6752099Z compiled=True, 2025-05-07T20:32:14.6752168Z ) 2025-05-07T20:32:14.6752393Z self = 2025-05-07T20:32:14.6752572Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:14.6752577Z 2025-05-07T20:32:14.6752648Z @given( 2025-05-07T20:32:14.6752771Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.6752865Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.6752978Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.6753184Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.6753295Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.6753374Z ) 2025-05-07T20:32:14.6753634Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.6753726Z def test_silu_mul_quant( 2025-05-07T20:32:14.6753805Z self, 2025-05-07T20:32:14.6753879Z T: int, 2025-05-07T20:32:14.6753955Z D: int, 2025-05-07T20:32:14.6754054Z scale_ub: Optional[float], 2025-05-07T20:32:14.6754141Z contiguous: bool, 2025-05-07T20:32:14.6754224Z compiled: bool, 2025-05-07T20:32:14.6754303Z ) -> None: 2025-05-07T20:32:14.6754395Z torch.manual_seed(2025) 2025-05-07T20:32:14.6754469Z 2025-05-07T20:32:14.6754643Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.6754715Z 2025-05-07T20:32:14.6754803Z x_sign = torch.sign(x) 2025-05-07T20:32:14.6754933Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:14.6755018Z x = x_sign * x_clamp 2025-05-07T20:32:14.6755100Z x0 = x[:, :D] 2025-05-07T20:32:14.6755177Z x1 = x[:, D:] 2025-05-07T20:32:14.6755252Z 2025-05-07T20:32:14.6755337Z if contiguous: 2025-05-07T20:32:14.6755426Z x0 = x0.contiguous() 2025-05-07T20:32:14.6755514Z x1 = x1.contiguous() 2025-05-07T20:32:14.6755587Z 2025-05-07T20:32:14.6755675Z if scale_ub is not None: 2025-05-07T20:32:14.6755777Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:14.6755916Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:14.6755993Z ) 2025-05-07T20:32:14.6756068Z else: 2025-05-07T20:32:14.6756162Z scale_ub_tensor = None 2025-05-07T20:32:14.6756234Z 2025-05-07T20:32:14.6756365Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:14.6756451Z op = silu_mul_quant 2025-05-07T20:32:14.6756539Z if compiled: 2025-05-07T20:32:14.6756642Z op = torch.compile(op) 2025-05-07T20:32:14.6756745Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.6756899Z 2025-05-07T20:32:14.6756992Z > y_fp8, y_scale = fn() 2025-05-07T20:32:14.6756997Z 2025-05-07T20:32:14.6757092Z moe/activation_test.py:117: 2025-05-07T20:32:14.6757222Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.6757325Z moe/activation_test.py:115: in fn 2025-05-07T20:32:14.6757424Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.6757819Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:14.6757911Z return fn(*args, **kwargs) 2025-05-07T20:32:14.6758448Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:14.6758554Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:14.6758936Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:14.6759171Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:14.6759535Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:14.6759628Z kernel = self.compile( 2025-05-07T20:32:14.6760039Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:14.6760216Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:14.6760344Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.6760349Z 2025-05-07T20:32:14.6760561Z self = 2025-05-07T20:32:14.6761409Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:14.6762122Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1cefadf4c0>} 2025-05-07T20:32:14.6762934Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:14.6763129Z context = 2025-05-07T20:32:14.6763138Z 2025-05-07T20:32:14.6763304Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:14.6763577Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:14.6763690Z module_map=module_map) 2025-05-07T20:32:14.6763852Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:14.6763951Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:14.6764037Z E ^ 2025-05-07T20:32:14.6764417Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:14.6764802Z 2025-05-07T20:32:14.6765252Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:14.6765256Z 2025-05-07T20:32:14.6765357Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.6765586Z self=, 2025-05-07T20:32:14.6765671Z T=1, 2025-05-07T20:32:14.6765750Z D=5120, 2025-05-07T20:32:14.6765833Z scale_ub=None, 2025-05-07T20:32:14.6765928Z contiguous=False, 2025-05-07T20:32:14.6766015Z compiled=False, 2025-05-07T20:32:14.6766098Z ) 2025-05-07T20:32:14.6766322Z self = 2025-05-07T20:32:14.6766596Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:14.6766601Z 2025-05-07T20:32:14.6766681Z @given( 2025-05-07T20:32:14.6766798Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.6766892Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.6767009Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.6767124Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.6767233Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.6767311Z ) 2025-05-07T20:32:14.6767567Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.6767665Z def test_silu_mul_quant( 2025-05-07T20:32:14.6767743Z self, 2025-05-07T20:32:14.6767819Z T: int, 2025-05-07T20:32:14.6767908Z D: int, 2025-05-07T20:32:14.6768004Z scale_ub: Optional[float], 2025-05-07T20:32:14.6768091Z contiguous: bool, 2025-05-07T20:32:14.6768181Z compiled: bool, 2025-05-07T20:32:14.6768259Z ) -> None: 2025-05-07T20:32:14.6768355Z torch.manual_seed(2025) 2025-05-07T20:32:14.6768430Z 2025-05-07T20:32:14.6768600Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.6768671Z 2025-05-07T20:32:14.6768765Z x_sign = torch.sign(x) 2025-05-07T20:32:14.6768886Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:14.6768976Z x = x_sign * x_clamp 2025-05-07T20:32:14.6769054Z x0 = x[:, :D] 2025-05-07T20:32:14.6769129Z x1 = x[:, D:] 2025-05-07T20:32:14.6769208Z 2025-05-07T20:32:14.6769287Z if contiguous: 2025-05-07T20:32:14.6769374Z x0 = x0.contiguous() 2025-05-07T20:32:14.6769466Z x1 = x1.contiguous() 2025-05-07T20:32:14.6769537Z 2025-05-07T20:32:14.6769712Z if scale_ub is not None: 2025-05-07T20:32:14.6769818Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:14.6769952Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:14.6770033Z ) 2025-05-07T20:32:14.6770109Z else: 2025-05-07T20:32:14.6770210Z scale_ub_tensor = None 2025-05-07T20:32:14.6770293Z 2025-05-07T20:32:14.6770446Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:14.6770536Z op = silu_mul_quant 2025-05-07T20:32:14.6770622Z if compiled: 2025-05-07T20:32:14.6770720Z op = torch.compile(op) 2025-05-07T20:32:14.6770824Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.6770899Z 2025-05-07T20:32:14.6770987Z > y_fp8, y_scale = fn() 2025-05-07T20:32:14.6770992Z 2025-05-07T20:32:14.6771085Z moe/activation_test.py:117: 2025-05-07T20:32:14.6771225Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.6771327Z moe/activation_test.py:115: in fn 2025-05-07T20:32:14.6771425Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.6771970Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:14.6772069Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:14.6772454Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:14.6772688Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:14.6773051Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:14.6773144Z kernel = self.compile( 2025-05-07T20:32:14.6773553Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:14.6773738Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:14.6773874Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.6773879Z 2025-05-07T20:32:14.6774169Z self = 2025-05-07T20:32:14.6775016Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:14.6775559Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1cefadf820>} 2025-05-07T20:32:14.6776375Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:14.6776575Z context = 2025-05-07T20:32:14.6776579Z 2025-05-07T20:32:14.6776749Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:14.6777032Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:14.6777139Z module_map=module_map) 2025-05-07T20:32:14.6777303Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:14.6777402Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:14.6777480Z E ^ 2025-05-07T20:32:14.6777860Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:14.6777864Z 2025-05-07T20:32:14.6778308Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:14.6778312Z 2025-05-07T20:32:14.6778446Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.6778826Z self=, 2025-05-07T20:32:14.6778904Z T=4096, 2025-05-07T20:32:14.6778987Z D=7168, 2025-05-07T20:32:14.6779078Z scale_ub=1200.0, 2025-05-07T20:32:14.6779168Z contiguous=False, 2025-05-07T20:32:14.6779256Z compiled=False, 2025-05-07T20:32:14.6779330Z ) 2025-05-07T20:32:14.6779556Z self = 2025-05-07T20:32:14.6779745Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:14.6779749Z 2025-05-07T20:32:14.6779830Z @given( 2025-05-07T20:32:14.6779951Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.6780050Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.6780167Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.6780303Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.6780428Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.6780529Z ) 2025-05-07T20:32:14.6780794Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.6780888Z def test_silu_mul_quant( 2025-05-07T20:32:14.6780970Z self, 2025-05-07T20:32:14.6781054Z T: int, 2025-05-07T20:32:14.6781133Z D: int, 2025-05-07T20:32:14.6781232Z scale_ub: Optional[float], 2025-05-07T20:32:14.6781324Z contiguous: bool, 2025-05-07T20:32:14.6781411Z compiled: bool, 2025-05-07T20:32:14.6781495Z ) -> None: 2025-05-07T20:32:14.6781592Z torch.manual_seed(2025) 2025-05-07T20:32:14.6781664Z 2025-05-07T20:32:14.6781841Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.6781917Z 2025-05-07T20:32:14.6782010Z x_sign = torch.sign(x) 2025-05-07T20:32:14.6782137Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:14.6782227Z x = x_sign * x_clamp 2025-05-07T20:32:14.6782313Z x0 = x[:, :D] 2025-05-07T20:32:14.6782398Z x1 = x[:, D:] 2025-05-07T20:32:14.6782470Z 2025-05-07T20:32:14.6782552Z if contiguous: 2025-05-07T20:32:14.6782730Z x0 = x0.contiguous() 2025-05-07T20:32:14.6783238Z x1 = x1.contiguous() 2025-05-07T20:32:14.6783321Z 2025-05-07T20:32:14.6783411Z if scale_ub is not None: 2025-05-07T20:32:14.6783520Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:14.6783661Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:14.6783737Z ) 2025-05-07T20:32:14.6783811Z else: 2025-05-07T20:32:14.6783909Z scale_ub_tensor = None 2025-05-07T20:32:14.6783988Z 2025-05-07T20:32:14.6784119Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:14.6784217Z op = silu_mul_quant 2025-05-07T20:32:14.6784303Z if compiled: 2025-05-07T20:32:14.6784401Z op = torch.compile(op) 2025-05-07T20:32:14.6784513Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.6784587Z 2025-05-07T20:32:14.6784681Z > y_fp8, y_scale = fn() 2025-05-07T20:32:14.6784686Z 2025-05-07T20:32:14.6784787Z moe/activation_test.py:117: 2025-05-07T20:32:14.6784921Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.6785023Z moe/activation_test.py:115: in fn 2025-05-07T20:32:14.6785122Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.6785664Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:14.6785764Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:14.6786147Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:14.6786383Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:14.6786745Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:14.6786983Z kernel = self.compile( 2025-05-07T20:32:14.6787400Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:14.6787579Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:14.6787710Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.6787721Z 2025-05-07T20:32:14.6787930Z self = 2025-05-07T20:32:14.6788780Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:14.6789332Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1d7c1dcaf0>} 2025-05-07T20:32:14.6790220Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:14.6790424Z context = 2025-05-07T20:32:14.6790429Z 2025-05-07T20:32:14.6790596Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:14.6790873Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:14.6790987Z module_map=module_map) 2025-05-07T20:32:14.6791151Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:14.6791250Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:14.6791331Z E ^ 2025-05-07T20:32:14.6791712Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:14.6791723Z 2025-05-07T20:32:14.6792288Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:14.6792294Z 2025-05-07T20:32:14.6792398Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.6792633Z self=, 2025-05-07T20:32:14.6792715Z T=16384, 2025-05-07T20:32:14.6792791Z D=7168, 2025-05-07T20:32:14.6792877Z scale_ub=None, 2025-05-07T20:32:14.6792960Z contiguous=True, 2025-05-07T20:32:14.6793042Z compiled=True, 2025-05-07T20:32:14.6793118Z ) 2025-05-07T20:32:14.6793343Z self = 2025-05-07T20:32:14.6793525Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:14.6793530Z 2025-05-07T20:32:14.6793615Z @given( 2025-05-07T20:32:14.6793738Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.6793837Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.6793959Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.6794083Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.6794200Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.6794273Z ) 2025-05-07T20:32:14.6794532Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.6794627Z def test_silu_mul_quant( 2025-05-07T20:32:14.6794703Z self, 2025-05-07T20:32:14.6794779Z T: int, 2025-05-07T20:32:14.6794859Z D: int, 2025-05-07T20:32:14.6794958Z scale_ub: Optional[float], 2025-05-07T20:32:14.6795046Z contiguous: bool, 2025-05-07T20:32:14.6795137Z compiled: bool, 2025-05-07T20:32:14.6795215Z ) -> None: 2025-05-07T20:32:14.6795310Z torch.manual_seed(2025) 2025-05-07T20:32:14.6795386Z 2025-05-07T20:32:14.6795649Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.6795727Z 2025-05-07T20:32:14.6795821Z x_sign = torch.sign(x) 2025-05-07T20:32:14.6795950Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:14.6796042Z x = x_sign * x_clamp 2025-05-07T20:32:14.6796122Z x0 = x[:, :D] 2025-05-07T20:32:14.6796200Z x1 = x[:, D:] 2025-05-07T20:32:14.6796276Z 2025-05-07T20:32:14.6796360Z if contiguous: 2025-05-07T20:32:14.6796451Z x0 = x0.contiguous() 2025-05-07T20:32:14.6796548Z x1 = x1.contiguous() 2025-05-07T20:32:14.6796626Z 2025-05-07T20:32:14.6796717Z if scale_ub is not None: 2025-05-07T20:32:14.6796829Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:14.6796964Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:14.6797040Z ) 2025-05-07T20:32:14.6797119Z else: 2025-05-07T20:32:14.6797212Z scale_ub_tensor = None 2025-05-07T20:32:14.6797297Z 2025-05-07T20:32:14.6797428Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:14.6797517Z op = silu_mul_quant 2025-05-07T20:32:14.6797612Z if compiled: 2025-05-07T20:32:14.6797715Z op = torch.compile(op) 2025-05-07T20:32:14.6797822Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.6797896Z 2025-05-07T20:32:14.6797989Z > y_fp8, y_scale = fn() 2025-05-07T20:32:14.6797994Z 2025-05-07T20:32:14.6798089Z moe/activation_test.py:117: 2025-05-07T20:32:14.6798224Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.6798323Z moe/activation_test.py:115: in fn 2025-05-07T20:32:14.6798426Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.6798818Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:14.6798911Z return fn(*args, **kwargs) 2025-05-07T20:32:14.6799457Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:14.6799554Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:14.6800040Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:14.6800280Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:14.6800642Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:14.6800741Z kernel = self.compile( 2025-05-07T20:32:14.6801150Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:14.6801329Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:14.6801470Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.6801479Z 2025-05-07T20:32:14.6801689Z self = 2025-05-07T20:32:14.6802550Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:14.6803097Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1cefeec790>} 2025-05-07T20:32:14.6803911Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:14.6804112Z context = 2025-05-07T20:32:14.6804117Z 2025-05-07T20:32:14.6804285Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:14.6804648Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:14.6804761Z module_map=module_map) 2025-05-07T20:32:14.6804924Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:14.6805024Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:14.6805100Z E ^ 2025-05-07T20:32:14.6805484Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:14.6805493Z 2025-05-07T20:32:14.6805940Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:14.6805944Z 2025-05-07T20:32:14.6806049Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.6806286Z self=, 2025-05-07T20:32:14.6806377Z T=4096, 2025-05-07T20:32:14.6806454Z D=5120, 2025-05-07T20:32:14.6806545Z scale_ub=None, 2025-05-07T20:32:14.6810090Z contiguous=False, 2025-05-07T20:32:14.6810195Z compiled=True, 2025-05-07T20:32:14.6810291Z ) 2025-05-07T20:32:14.6810536Z self = 2025-05-07T20:32:14.6810718Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:14.6810723Z 2025-05-07T20:32:14.6810802Z @given( 2025-05-07T20:32:14.6810928Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.6811027Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.6811142Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.6811261Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.6811376Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.6811457Z ) 2025-05-07T20:32:14.6811717Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.6811821Z def test_silu_mul_quant( 2025-05-07T20:32:14.6811906Z self, 2025-05-07T20:32:14.6811985Z T: int, 2025-05-07T20:32:14.6812062Z D: int, 2025-05-07T20:32:14.6812270Z scale_ub: Optional[float], 2025-05-07T20:32:14.6812362Z contiguous: bool, 2025-05-07T20:32:14.6812451Z compiled: bool, 2025-05-07T20:32:14.6812541Z ) -> None: 2025-05-07T20:32:14.6812636Z torch.manual_seed(2025) 2025-05-07T20:32:14.6812713Z 2025-05-07T20:32:14.6812894Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.6812968Z 2025-05-07T20:32:14.6813067Z x_sign = torch.sign(x) 2025-05-07T20:32:14.6813194Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:14.6813285Z x = x_sign * x_clamp 2025-05-07T20:32:14.6813372Z x0 = x[:, :D] 2025-05-07T20:32:14.6813452Z x1 = x[:, D:] 2025-05-07T20:32:14.6813526Z 2025-05-07T20:32:14.6813617Z if contiguous: 2025-05-07T20:32:14.6813714Z x0 = x0.contiguous() 2025-05-07T20:32:14.6813803Z x1 = x1.contiguous() 2025-05-07T20:32:14.6813885Z 2025-05-07T20:32:14.6813984Z if scale_ub is not None: 2025-05-07T20:32:14.6814091Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:14.6814232Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:14.6814309Z ) 2025-05-07T20:32:14.6814386Z else: 2025-05-07T20:32:14.6814485Z scale_ub_tensor = None 2025-05-07T20:32:14.6814559Z 2025-05-07T20:32:14.6814693Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:14.6814784Z op = silu_mul_quant 2025-05-07T20:32:14.6814873Z if compiled: 2025-05-07T20:32:14.6814980Z op = torch.compile(op) 2025-05-07T20:32:14.6815089Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.6815163Z 2025-05-07T20:32:14.6815261Z > y_fp8, y_scale = fn() 2025-05-07T20:32:14.6815349Z 2025-05-07T20:32:14.6815448Z moe/activation_test.py:117: 2025-05-07T20:32:14.6815580Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.6815693Z moe/activation_test.py:115: in fn 2025-05-07T20:32:14.6815795Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.6816200Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:14.6816293Z return fn(*args, **kwargs) 2025-05-07T20:32:14.6816836Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:14.6816942Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:14.6817329Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:14.6817563Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:14.6817939Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:14.6818035Z kernel = self.compile( 2025-05-07T20:32:14.6818460Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:14.6818640Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:14.6818772Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.6818777Z 2025-05-07T20:32:14.6818993Z self = 2025-05-07T20:32:14.6819886Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:14.6820456Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1cefdb2550>} 2025-05-07T20:32:14.6821368Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:14.6821572Z context = 2025-05-07T20:32:14.6821576Z 2025-05-07T20:32:14.6821747Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:14.6822025Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:14.6822138Z module_map=module_map) 2025-05-07T20:32:14.6822301Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:14.6822400Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:14.6822484Z E ^ 2025-05-07T20:32:14.6822879Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:14.6822883Z 2025-05-07T20:32:14.6823339Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:14.6823344Z 2025-05-07T20:32:14.6823447Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.6823682Z self=, 2025-05-07T20:32:14.6823763Z T=4096, 2025-05-07T20:32:14.6823843Z D=5120, 2025-05-07T20:32:14.6823929Z scale_ub=1200.0, 2025-05-07T20:32:14.6824025Z contiguous=False, 2025-05-07T20:32:14.6824109Z compiled=False, 2025-05-07T20:32:14.6824184Z ) 2025-05-07T20:32:14.6824415Z self = 2025-05-07T20:32:14.6824599Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:14.6824603Z 2025-05-07T20:32:14.6824767Z @given( 2025-05-07T20:32:14.6824886Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.6824985Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.6825109Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.6825229Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.6825344Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.6825423Z ) 2025-05-07T20:32:14.6825685Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.6825782Z def test_silu_mul_quant( 2025-05-07T20:32:14.6825861Z self, 2025-05-07T20:32:14.6825939Z T: int, 2025-05-07T20:32:14.6826021Z D: int, 2025-05-07T20:32:14.6826121Z scale_ub: Optional[float], 2025-05-07T20:32:14.6826211Z contiguous: bool, 2025-05-07T20:32:14.6826305Z compiled: bool, 2025-05-07T20:32:14.6826384Z ) -> None: 2025-05-07T20:32:14.6826477Z torch.manual_seed(2025) 2025-05-07T20:32:14.6826561Z 2025-05-07T20:32:14.6826734Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.6826808Z 2025-05-07T20:32:14.6826912Z x_sign = torch.sign(x) 2025-05-07T20:32:14.6827036Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:14.6827125Z x = x_sign * x_clamp 2025-05-07T20:32:14.6827211Z x0 = x[:, :D] 2025-05-07T20:32:14.6827293Z x1 = x[:, D:] 2025-05-07T20:32:14.6827366Z 2025-05-07T20:32:14.6827457Z if contiguous: 2025-05-07T20:32:14.6827549Z x0 = x0.contiguous() 2025-05-07T20:32:14.6827641Z x1 = x1.contiguous() 2025-05-07T20:32:14.6827717Z 2025-05-07T20:32:14.6827810Z if scale_ub is not None: 2025-05-07T20:32:14.6827918Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:14.6828057Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:14.6828132Z ) 2025-05-07T20:32:14.6828220Z else: 2025-05-07T20:32:14.6828315Z scale_ub_tensor = None 2025-05-07T20:32:14.6828388Z 2025-05-07T20:32:14.6828524Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:14.6828693Z op = silu_mul_quant 2025-05-07T20:32:14.6828786Z if compiled: 2025-05-07T20:32:14.6828890Z op = torch.compile(op) 2025-05-07T20:32:14.6828999Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.6829075Z 2025-05-07T20:32:14.6829166Z > y_fp8, y_scale = fn() 2025-05-07T20:32:14.6829170Z 2025-05-07T20:32:14.6829273Z moe/activation_test.py:117: 2025-05-07T20:32:14.6829409Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.6829512Z moe/activation_test.py:115: in fn 2025-05-07T20:32:14.6829613Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.6830277Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:14.6830381Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:14.6830768Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:14.6831009Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:14.6831373Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:14.6831471Z kernel = self.compile( 2025-05-07T20:32:14.6831881Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:14.6832060Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:14.6832192Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.6832197Z 2025-05-07T20:32:14.6832408Z self = 2025-05-07T20:32:14.6833371Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:14.6833919Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1cefcc50d0>} 2025-05-07T20:32:14.6834737Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:14.6834935Z context = 2025-05-07T20:32:14.6834939Z 2025-05-07T20:32:14.6835109Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:14.6835388Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:14.6835504Z module_map=module_map) 2025-05-07T20:32:14.6835668Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:14.6835775Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:14.6835854Z E ^ 2025-05-07T20:32:14.6836244Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:14.6836248Z 2025-05-07T20:32:14.6836698Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:14.6836702Z 2025-05-07T20:32:14.6836808Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.6837045Z self=, 2025-05-07T20:32:14.6837124Z T=4096, 2025-05-07T20:32:14.6837204Z D=5120, 2025-05-07T20:32:14.6837289Z scale_ub=1200.0, 2025-05-07T20:32:14.6837380Z contiguous=False, 2025-05-07T20:32:14.6837472Z compiled=True, 2025-05-07T20:32:14.6837545Z ) 2025-05-07T20:32:14.6837772Z self = 2025-05-07T20:32:14.6838033Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:14.6838038Z 2025-05-07T20:32:14.6838119Z @given( 2025-05-07T20:32:14.6838240Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.6838342Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.6838458Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.6838580Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.6838693Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.6838767Z ) 2025-05-07T20:32:14.6839030Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.6839126Z def test_silu_mul_quant( 2025-05-07T20:32:14.6839203Z self, 2025-05-07T20:32:14.6839291Z T: int, 2025-05-07T20:32:14.6839368Z D: int, 2025-05-07T20:32:14.6839468Z scale_ub: Optional[float], 2025-05-07T20:32:14.6839567Z contiguous: bool, 2025-05-07T20:32:14.6839658Z compiled: bool, 2025-05-07T20:32:14.6839739Z ) -> None: 2025-05-07T20:32:14.6839838Z torch.manual_seed(2025) 2025-05-07T20:32:14.6839912Z 2025-05-07T20:32:14.6840091Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.6840166Z 2025-05-07T20:32:14.6840259Z x_sign = torch.sign(x) 2025-05-07T20:32:14.6840386Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:14.6840476Z x = x_sign * x_clamp 2025-05-07T20:32:14.6840558Z x0 = x[:, :D] 2025-05-07T20:32:14.6840645Z x1 = x[:, D:] 2025-05-07T20:32:14.6840719Z 2025-05-07T20:32:14.6840803Z if contiguous: 2025-05-07T20:32:14.6840903Z x0 = x0.contiguous() 2025-05-07T20:32:14.6840992Z x1 = x1.contiguous() 2025-05-07T20:32:14.6841149Z 2025-05-07T20:32:14.6841243Z if scale_ub is not None: 2025-05-07T20:32:14.6841350Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:14.6841491Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:14.6841575Z ) 2025-05-07T20:32:14.6841654Z else: 2025-05-07T20:32:14.6841752Z scale_ub_tensor = None 2025-05-07T20:32:14.6841826Z 2025-05-07T20:32:14.6841957Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:14.6842055Z op = silu_mul_quant 2025-05-07T20:32:14.6842140Z if compiled: 2025-05-07T20:32:14.6842243Z op = torch.compile(op) 2025-05-07T20:32:14.6842353Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.6842425Z 2025-05-07T20:32:14.6842517Z > y_fp8, y_scale = fn() 2025-05-07T20:32:14.6842522Z 2025-05-07T20:32:14.6842623Z moe/activation_test.py:117: 2025-05-07T20:32:14.6842756Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.6842868Z moe/activation_test.py:115: in fn 2025-05-07T20:32:14.6842968Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.6843365Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:14.6843462Z return fn(*args, **kwargs) 2025-05-07T20:32:14.6844000Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:14.6844098Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:14.6844485Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:14.6844718Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:14.6845083Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:14.6845180Z kernel = self.compile( 2025-05-07T20:32:14.6845591Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:14.6845850Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:14.6845984Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.6845988Z 2025-05-07T20:32:14.6846202Z self = 2025-05-07T20:32:14.6847052Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:14.6847601Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1cefcc5dc0>} 2025-05-07T20:32:14.6848417Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:14.6848617Z context = 2025-05-07T20:32:14.6848621Z 2025-05-07T20:32:14.6848795Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:14.6849067Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:14.6849176Z module_map=module_map) 2025-05-07T20:32:14.6849344Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:14.6849441Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:14.6849515Z E ^ 2025-05-07T20:32:14.6849895Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:14.6849977Z 2025-05-07T20:32:14.6850420Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:14.6850425Z 2025-05-07T20:32:14.6850537Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.6850769Z self=, 2025-05-07T20:32:14.6850844Z T=2048, 2025-05-07T20:32:14.6850924Z D=7168, 2025-05-07T20:32:14.6851006Z scale_ub=1200.0, 2025-05-07T20:32:14.6851095Z contiguous=False, 2025-05-07T20:32:14.6851180Z compiled=False, 2025-05-07T20:32:14.6851249Z ) 2025-05-07T20:32:14.6851476Z self = 2025-05-07T20:32:14.6851653Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:14.6851657Z 2025-05-07T20:32:14.6851730Z @given( 2025-05-07T20:32:14.6851854Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.6851958Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.6852071Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.6852198Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.6852314Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.6852389Z ) 2025-05-07T20:32:14.6852644Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.6852734Z def test_silu_mul_quant( 2025-05-07T20:32:14.6852810Z self, 2025-05-07T20:32:14.6852886Z T: int, 2025-05-07T20:32:14.6852958Z D: int, 2025-05-07T20:32:14.6853062Z scale_ub: Optional[float], 2025-05-07T20:32:14.6853148Z contiguous: bool, 2025-05-07T20:32:14.6853232Z compiled: bool, 2025-05-07T20:32:14.6853315Z ) -> None: 2025-05-07T20:32:14.6853407Z torch.manual_seed(2025) 2025-05-07T20:32:14.6853477Z 2025-05-07T20:32:14.6853648Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.6853726Z 2025-05-07T20:32:14.6853823Z x_sign = torch.sign(x) 2025-05-07T20:32:14.6853944Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:14.6854115Z x = x_sign * x_clamp 2025-05-07T20:32:14.6854204Z x0 = x[:, :D] 2025-05-07T20:32:14.6854280Z x1 = x[:, D:] 2025-05-07T20:32:14.6854352Z 2025-05-07T20:32:14.6854435Z if contiguous: 2025-05-07T20:32:14.6854524Z x0 = x0.contiguous() 2025-05-07T20:32:14.6854613Z x1 = x1.contiguous() 2025-05-07T20:32:14.6854691Z 2025-05-07T20:32:14.6854780Z if scale_ub is not None: 2025-05-07T20:32:14.6854884Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:14.6855021Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:14.6855097Z ) 2025-05-07T20:32:14.6855175Z else: 2025-05-07T20:32:14.6855273Z scale_ub_tensor = None 2025-05-07T20:32:14.6855343Z 2025-05-07T20:32:14.6855484Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:14.6855573Z op = silu_mul_quant 2025-05-07T20:32:14.6855656Z if compiled: 2025-05-07T20:32:14.6855766Z op = torch.compile(op) 2025-05-07T20:32:14.6855872Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.6855941Z 2025-05-07T20:32:14.6856033Z > y_fp8, y_scale = fn() 2025-05-07T20:32:14.6856038Z 2025-05-07T20:32:14.6856133Z moe/activation_test.py:117: 2025-05-07T20:32:14.6856263Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.6856364Z moe/activation_test.py:115: in fn 2025-05-07T20:32:14.6856462Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.6857002Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:14.6857096Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:14.6857475Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:14.6857792Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:14.6858159Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:14.6858255Z kernel = self.compile( 2025-05-07T20:32:14.6858667Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:14.6858844Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:14.6858975Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.6858979Z 2025-05-07T20:32:14.6859188Z self = 2025-05-07T20:32:14.6860036Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:14.6860591Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1d7c027670>} 2025-05-07T20:32:14.6861404Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:14.6861603Z context = 2025-05-07T20:32:14.6861607Z 2025-05-07T20:32:14.6861773Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:14.6862048Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:14.6862153Z module_map=module_map) 2025-05-07T20:32:14.6862317Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:14.6862419Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:14.6862491Z E ^ 2025-05-07T20:32:14.6862949Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:14.6862955Z 2025-05-07T20:32:14.6863401Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:14.6863406Z 2025-05-07T20:32:14.6863507Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.6863742Z self=, 2025-05-07T20:32:14.6863819Z T=1, 2025-05-07T20:32:14.6863896Z D=7168, 2025-05-07T20:32:14.6863979Z scale_ub=None, 2025-05-07T20:32:14.6864062Z contiguous=True, 2025-05-07T20:32:14.6864144Z compiled=False, 2025-05-07T20:32:14.6864219Z ) 2025-05-07T20:32:14.6864442Z self = 2025-05-07T20:32:14.6864615Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:14.6864625Z 2025-05-07T20:32:14.6864703Z @given( 2025-05-07T20:32:14.6864824Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.6864923Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.6865036Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.6865151Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.6865266Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.6865338Z ) 2025-05-07T20:32:14.6865596Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.6865689Z def test_silu_mul_quant( 2025-05-07T20:32:14.6865763Z self, 2025-05-07T20:32:14.6865842Z T: int, 2025-05-07T20:32:14.6865923Z D: int, 2025-05-07T20:32:14.6866019Z scale_ub: Optional[float], 2025-05-07T20:32:14.6866218Z contiguous: bool, 2025-05-07T20:32:14.6866302Z compiled: bool, 2025-05-07T20:32:14.6866380Z ) -> None: 2025-05-07T20:32:14.6866479Z torch.manual_seed(2025) 2025-05-07T20:32:14.6866555Z 2025-05-07T20:32:14.6866723Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.6866800Z 2025-05-07T20:32:14.6866893Z x_sign = torch.sign(x) 2025-05-07T20:32:14.6867014Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:14.6867108Z x = x_sign * x_clamp 2025-05-07T20:32:14.6867183Z x0 = x[:, :D] 2025-05-07T20:32:14.6867260Z x1 = x[:, D:] 2025-05-07T20:32:14.6867341Z 2025-05-07T20:32:14.6867422Z if contiguous: 2025-05-07T20:32:14.6867514Z x0 = x0.contiguous() 2025-05-07T20:32:14.6867602Z x1 = x1.contiguous() 2025-05-07T20:32:14.6867676Z 2025-05-07T20:32:14.6867771Z if scale_ub is not None: 2025-05-07T20:32:14.6867874Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:14.6868014Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:14.6868094Z ) 2025-05-07T20:32:14.6868168Z else: 2025-05-07T20:32:14.6868262Z scale_ub_tensor = None 2025-05-07T20:32:14.6868335Z 2025-05-07T20:32:14.6868462Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:14.6868549Z op = silu_mul_quant 2025-05-07T20:32:14.6868638Z if compiled: 2025-05-07T20:32:14.6868737Z op = torch.compile(op) 2025-05-07T20:32:14.6868844Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.6868916Z 2025-05-07T20:32:14.6869014Z > y_fp8, y_scale = fn() 2025-05-07T20:32:14.6869018Z 2025-05-07T20:32:14.6869116Z moe/activation_test.py:117: 2025-05-07T20:32:14.6869257Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.6869355Z moe/activation_test.py:115: in fn 2025-05-07T20:32:14.6869457Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.6870102Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:14.6870281Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:14.6870668Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:14.6870901Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:14.6871261Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:14.6871352Z kernel = self.compile( 2025-05-07T20:32:14.6871762Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:14.6871939Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:14.6872071Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.6872079Z 2025-05-07T20:32:14.6872286Z self = 2025-05-07T20:32:14.6873135Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:14.6873680Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1cef902160>} 2025-05-07T20:32:14.6874492Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:14.6874688Z context = 2025-05-07T20:32:14.6874769Z 2025-05-07T20:32:14.6874938Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:14.6875221Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:14.6875328Z module_map=module_map) 2025-05-07T20:32:14.6875494Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:14.6875591Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:14.6875667Z E ^ 2025-05-07T20:32:14.6876047Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:14.6876052Z 2025-05-07T20:32:14.6876494Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:14.6876498Z 2025-05-07T20:32:14.6876601Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.6876835Z self=, 2025-05-07T20:32:14.6876915Z T=16384, 2025-05-07T20:32:14.6876996Z D=7168, 2025-05-07T20:32:14.6877076Z scale_ub=1200.0, 2025-05-07T20:32:14.6877162Z contiguous=False, 2025-05-07T20:32:14.6877252Z compiled=True, 2025-05-07T20:32:14.6877324Z ) 2025-05-07T20:32:14.6877547Z self = 2025-05-07T20:32:14.6877735Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:14.6877739Z 2025-05-07T20:32:14.6877816Z @given( 2025-05-07T20:32:14.6877931Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.6878035Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.6878148Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.6878270Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.6878380Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.6878453Z ) 2025-05-07T20:32:14.6878716Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.6878807Z def test_silu_mul_quant( 2025-05-07T20:32:14.6878882Z self, 2025-05-07T20:32:14.6878958Z T: int, 2025-05-07T20:32:14.6879117Z D: int, 2025-05-07T20:32:14.6879217Z scale_ub: Optional[float], 2025-05-07T20:32:14.6879310Z contiguous: bool, 2025-05-07T20:32:14.6879394Z compiled: bool, 2025-05-07T20:32:14.6879472Z ) -> None: 2025-05-07T20:32:14.6879571Z torch.manual_seed(2025) 2025-05-07T20:32:14.6879644Z 2025-05-07T20:32:14.6879818Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.6879890Z 2025-05-07T20:32:14.6879980Z x_sign = torch.sign(x) 2025-05-07T20:32:14.6880109Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:14.6880197Z x = x_sign * x_clamp 2025-05-07T20:32:14.6880278Z x0 = x[:, :D] 2025-05-07T20:32:14.6880361Z x1 = x[:, D:] 2025-05-07T20:32:14.6880435Z 2025-05-07T20:32:14.6880516Z if contiguous: 2025-05-07T20:32:14.6880612Z x0 = x0.contiguous() 2025-05-07T20:32:14.6880703Z x1 = x1.contiguous() 2025-05-07T20:32:14.6880784Z 2025-05-07T20:32:14.6880877Z if scale_ub is not None: 2025-05-07T20:32:14.6880982Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:14.6881120Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:14.6881191Z ) 2025-05-07T20:32:14.6881267Z else: 2025-05-07T20:32:14.6881363Z scale_ub_tensor = None 2025-05-07T20:32:14.6881433Z 2025-05-07T20:32:14.6881560Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:14.6881652Z op = silu_mul_quant 2025-05-07T20:32:14.6881735Z if compiled: 2025-05-07T20:32:14.6881831Z op = torch.compile(op) 2025-05-07T20:32:14.6881936Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.6882090Z 2025-05-07T20:32:14.6882180Z > y_fp8, y_scale = fn() 2025-05-07T20:32:14.6882189Z 2025-05-07T20:32:14.6882285Z moe/activation_test.py:117: 2025-05-07T20:32:14.6882421Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.6882523Z moe/activation_test.py:115: in fn 2025-05-07T20:32:14.6882620Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.6883287Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:14.6883391Z return fn(*args, **kwargs) 2025-05-07T20:32:14.6883927Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:14.6884027Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:14.6884411Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:14.6884642Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:14.6885011Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:14.6885109Z kernel = self.compile( 2025-05-07T20:32:14.6885517Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:14.6885698Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:14.6885828Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.6885832Z 2025-05-07T20:32:14.6886046Z self = 2025-05-07T20:32:14.6886891Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:14.6887441Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1cef9024c0>} 2025-05-07T20:32:14.6888402Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:14.6888603Z context = 2025-05-07T20:32:14.6888608Z 2025-05-07T20:32:14.6888778Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:14.6889051Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:14.6889156Z module_map=module_map) 2025-05-07T20:32:14.6889322Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:14.6889420Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:14.6889505Z E ^ 2025-05-07T20:32:14.6889884Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:14.6889889Z 2025-05-07T20:32:14.6890341Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:14.6890345Z 2025-05-07T20:32:14.6890453Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.6890681Z self=, 2025-05-07T20:32:14.6890763Z T=1, 2025-05-07T20:32:14.6890841Z D=7168, 2025-05-07T20:32:14.6890923Z scale_ub=None, 2025-05-07T20:32:14.6891014Z contiguous=False, 2025-05-07T20:32:14.6891098Z compiled=False, 2025-05-07T20:32:14.6891173Z ) 2025-05-07T20:32:14.6891400Z self = 2025-05-07T20:32:14.6891570Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:14.6891693Z 2025-05-07T20:32:14.6891772Z @given( 2025-05-07T20:32:14.6891895Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.6891992Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.6892110Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.6892227Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.6892337Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.6892415Z ) 2025-05-07T20:32:14.6892671Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.6892762Z def test_silu_mul_quant( 2025-05-07T20:32:14.6892845Z self, 2025-05-07T20:32:14.6892923Z T: int, 2025-05-07T20:32:14.6893001Z D: int, 2025-05-07T20:32:14.6893102Z scale_ub: Optional[float], 2025-05-07T20:32:14.6893192Z contiguous: bool, 2025-05-07T20:32:14.6893274Z compiled: bool, 2025-05-07T20:32:14.6893357Z ) -> None: 2025-05-07T20:32:14.6893456Z torch.manual_seed(2025) 2025-05-07T20:32:14.6893530Z 2025-05-07T20:32:14.6893703Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.6893778Z 2025-05-07T20:32:14.6893877Z x_sign = torch.sign(x) 2025-05-07T20:32:14.6894000Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:14.6894087Z x = x_sign * x_clamp 2025-05-07T20:32:14.6894174Z x0 = x[:, :D] 2025-05-07T20:32:14.6894256Z x1 = x[:, D:] 2025-05-07T20:32:14.6894331Z 2025-05-07T20:32:14.6894416Z if contiguous: 2025-05-07T20:32:14.6894508Z x0 = x0.contiguous() 2025-05-07T20:32:14.6894597Z x1 = x1.contiguous() 2025-05-07T20:32:14.6894677Z 2025-05-07T20:32:14.6894768Z if scale_ub is not None: 2025-05-07T20:32:14.6894874Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:14.6895012Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:14.6895094Z ) 2025-05-07T20:32:14.6895174Z else: 2025-05-07T20:32:14.6895272Z scale_ub_tensor = None 2025-05-07T20:32:14.6895344Z 2025-05-07T20:32:14.6895583Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:14.6895672Z op = silu_mul_quant 2025-05-07T20:32:14.6895757Z if compiled: 2025-05-07T20:32:14.6895859Z op = torch.compile(op) 2025-05-07T20:32:14.6895964Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.6896035Z 2025-05-07T20:32:14.6896126Z > y_fp8, y_scale = fn() 2025-05-07T20:32:14.6896130Z 2025-05-07T20:32:14.6896224Z moe/activation_test.py:117: 2025-05-07T20:32:14.6896356Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.6896460Z moe/activation_test.py:115: in fn 2025-05-07T20:32:14.6896559Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.6897101Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:14.6897203Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:14.6897590Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:14.6897824Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:14.6898185Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:14.6898280Z kernel = self.compile( 2025-05-07T20:32:14.6898687Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:14.6898866Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:14.6899001Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.6899005Z 2025-05-07T20:32:14.6899215Z self = 2025-05-07T20:32:14.6900221Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:14.6900765Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1cefc69820>} 2025-05-07T20:32:14.6901573Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:14.6901770Z context = 2025-05-07T20:32:14.6901774Z 2025-05-07T20:32:14.6901941Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:14.6902216Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:14.6902329Z module_map=module_map) 2025-05-07T20:32:14.6902497Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:14.6902600Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:14.6902682Z E ^ 2025-05-07T20:32:14.6903065Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:14.6903072Z 2025-05-07T20:32:14.6903516Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:14.6903520Z 2025-05-07T20:32:14.6903620Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.6903853Z self=, 2025-05-07T20:32:14.6903932Z T=2048, 2025-05-07T20:32:14.6904007Z D=7168, 2025-05-07T20:32:14.6904092Z scale_ub=None, 2025-05-07T20:32:14.6904182Z contiguous=False, 2025-05-07T20:32:14.6904264Z compiled=True, 2025-05-07T20:32:14.6904343Z ) 2025-05-07T20:32:14.6904643Z self = 2025-05-07T20:32:14.6904826Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:14.6904831Z 2025-05-07T20:32:14.6904908Z @given( 2025-05-07T20:32:14.6905027Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.6905129Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.6905247Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.6905364Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.6905482Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.6905561Z ) 2025-05-07T20:32:14.6905817Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.6905914Z def test_silu_mul_quant( 2025-05-07T20:32:14.6905995Z self, 2025-05-07T20:32:14.6906075Z T: int, 2025-05-07T20:32:14.6906149Z D: int, 2025-05-07T20:32:14.6906246Z scale_ub: Optional[float], 2025-05-07T20:32:14.6906341Z contiguous: bool, 2025-05-07T20:32:14.6906423Z compiled: bool, 2025-05-07T20:32:14.6906498Z ) -> None: 2025-05-07T20:32:14.6906597Z torch.manual_seed(2025) 2025-05-07T20:32:14.6906671Z 2025-05-07T20:32:14.6906838Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.6906913Z 2025-05-07T20:32:14.6907003Z x_sign = torch.sign(x) 2025-05-07T20:32:14.6907125Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:14.6907218Z x = x_sign * x_clamp 2025-05-07T20:32:14.6907297Z x0 = x[:, :D] 2025-05-07T20:32:14.6907379Z x1 = x[:, D:] 2025-05-07T20:32:14.6907450Z 2025-05-07T20:32:14.6907529Z if contiguous: 2025-05-07T20:32:14.6907624Z x0 = x0.contiguous() 2025-05-07T20:32:14.6907793Z x1 = x1.contiguous() 2025-05-07T20:32:14.6907864Z 2025-05-07T20:32:14.6907954Z if scale_ub is not None: 2025-05-07T20:32:14.6908058Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:14.6908202Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:14.6908282Z ) 2025-05-07T20:32:14.6908354Z else: 2025-05-07T20:32:14.6908447Z scale_ub_tensor = None 2025-05-07T20:32:14.6908520Z 2025-05-07T20:32:14.6908647Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:14.6908735Z op = silu_mul_quant 2025-05-07T20:32:14.6908822Z if compiled: 2025-05-07T20:32:14.6908920Z op = torch.compile(op) 2025-05-07T20:32:14.6909028Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.6909098Z 2025-05-07T20:32:14.6909187Z > y_fp8, y_scale = fn() 2025-05-07T20:32:14.6909191Z 2025-05-07T20:32:14.6909288Z moe/activation_test.py:117: 2025-05-07T20:32:14.6909424Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.6909522Z moe/activation_test.py:115: in fn 2025-05-07T20:32:14.6909628Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.6910112Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:14.6910207Z return fn(*args, **kwargs) 2025-05-07T20:32:14.6910743Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:14.6910839Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:14.6911222Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:14.6911457Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:14.6911819Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:14.6911919Z kernel = self.compile( 2025-05-07T20:32:14.6912411Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:14.6912593Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:14.6912721Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.6912725Z 2025-05-07T20:32:14.6912935Z self = 2025-05-07T20:32:14.6913787Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:14.6914330Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1cefa93790>} 2025-05-07T20:32:14.6915156Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:14.6915350Z context = 2025-05-07T20:32:14.6915354Z 2025-05-07T20:32:14.6915521Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:14.6915796Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:14.6915901Z module_map=module_map) 2025-05-07T20:32:14.6916063Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:14.6916158Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:14.6916231Z E ^ 2025-05-07T20:32:14.6916613Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:14.6916697Z 2025-05-07T20:32:14.6917140Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:14.6917151Z 2025-05-07T20:32:14.6917255Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.6917487Z self=, 2025-05-07T20:32:14.6917563Z T=4096, 2025-05-07T20:32:14.6917638Z D=7168, 2025-05-07T20:32:14.6917718Z scale_ub=None, 2025-05-07T20:32:14.6917804Z contiguous=False, 2025-05-07T20:32:14.6917890Z compiled=True, 2025-05-07T20:32:14.6917959Z ) 2025-05-07T20:32:14.6918182Z self = 2025-05-07T20:32:14.6918364Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:14.6918369Z 2025-05-07T20:32:14.6918443Z @given( 2025-05-07T20:32:14.6918562Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.6918667Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.6918781Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.6918903Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.6919013Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.6919086Z ) 2025-05-07T20:32:14.6919344Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.6919435Z def test_silu_mul_quant( 2025-05-07T20:32:14.6919509Z self, 2025-05-07T20:32:14.6919587Z T: int, 2025-05-07T20:32:14.6919660Z D: int, 2025-05-07T20:32:14.6919757Z scale_ub: Optional[float], 2025-05-07T20:32:14.6919844Z contiguous: bool, 2025-05-07T20:32:14.6919927Z compiled: bool, 2025-05-07T20:32:14.6920005Z ) -> None: 2025-05-07T20:32:14.6920097Z torch.manual_seed(2025) 2025-05-07T20:32:14.6920170Z 2025-05-07T20:32:14.6920343Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.6920421Z 2025-05-07T20:32:14.6920510Z x_sign = torch.sign(x) 2025-05-07T20:32:14.6920636Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:14.6920803Z x = x_sign * x_clamp 2025-05-07T20:32:14.6920882Z x0 = x[:, :D] 2025-05-07T20:32:14.6920962Z x1 = x[:, D:] 2025-05-07T20:32:14.6921032Z 2025-05-07T20:32:14.6921113Z if contiguous: 2025-05-07T20:32:14.6921205Z x0 = x0.contiguous() 2025-05-07T20:32:14.6921294Z x1 = x1.contiguous() 2025-05-07T20:32:14.6921369Z 2025-05-07T20:32:14.6921458Z if scale_ub is not None: 2025-05-07T20:32:14.6921564Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:14.6921701Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:14.6921774Z ) 2025-05-07T20:32:14.6921848Z else: 2025-05-07T20:32:14.6921948Z scale_ub_tensor = None 2025-05-07T20:32:14.6922024Z 2025-05-07T20:32:14.6922149Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:14.6922244Z op = silu_mul_quant 2025-05-07T20:32:14.6922324Z if compiled: 2025-05-07T20:32:14.6922429Z op = torch.compile(op) 2025-05-07T20:32:14.6922536Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.6922607Z 2025-05-07T20:32:14.6922698Z > y_fp8, y_scale = fn() 2025-05-07T20:32:14.6922703Z 2025-05-07T20:32:14.6922799Z moe/activation_test.py:117: 2025-05-07T20:32:14.6922929Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.6923032Z moe/activation_test.py:115: in fn 2025-05-07T20:32:14.6923127Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.6923515Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:14.6923608Z return fn(*args, **kwargs) 2025-05-07T20:32:14.6924142Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:14.6924323Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:14.6924708Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:14.6924940Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:14.6925304Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:14.6925397Z kernel = self.compile( 2025-05-07T20:32:14.6925803Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:14.6925984Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:14.6926112Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.6926117Z 2025-05-07T20:32:14.6926333Z self = 2025-05-07T20:32:14.6927182Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:14.6927728Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1cef9944c0>} 2025-05-07T20:32:14.6928537Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:14.6928732Z context = 2025-05-07T20:32:14.6928737Z 2025-05-07T20:32:14.6928906Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:14.6929183Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:14.6929289Z module_map=module_map) 2025-05-07T20:32:14.6929547Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:14.6929646Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:14.6929725Z E ^ 2025-05-07T20:32:14.6930104Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:14.6930109Z 2025-05-07T20:32:14.6930552Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:14.6930560Z 2025-05-07T20:32:14.6930662Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.6930895Z self=, 2025-05-07T20:32:14.6930973Z T=16384, 2025-05-07T20:32:14.6931052Z D=5120, 2025-05-07T20:32:14.6931133Z scale_ub=1200.0, 2025-05-07T20:32:14.6931221Z contiguous=False, 2025-05-07T20:32:14.6931305Z compiled=False, 2025-05-07T20:32:14.6931373Z ) 2025-05-07T20:32:14.6931603Z self = 2025-05-07T20:32:14.6931788Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:14.6931793Z 2025-05-07T20:32:14.6931867Z @given( 2025-05-07T20:32:14.6931986Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.6932082Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.6932199Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.6932313Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.6932424Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.6932499Z ) 2025-05-07T20:32:14.6932756Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.6932927Z def test_silu_mul_quant( 2025-05-07T20:32:14.6933011Z self, 2025-05-07T20:32:14.6936282Z T: int, 2025-05-07T20:32:14.6936374Z D: int, 2025-05-07T20:32:14.6936487Z scale_ub: Optional[float], 2025-05-07T20:32:14.6936584Z contiguous: bool, 2025-05-07T20:32:14.6936671Z compiled: bool, 2025-05-07T20:32:14.6936753Z ) -> None: 2025-05-07T20:32:14.6936849Z torch.manual_seed(2025) 2025-05-07T20:32:14.6936923Z 2025-05-07T20:32:14.6937097Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.6937173Z 2025-05-07T20:32:14.6937265Z x_sign = torch.sign(x) 2025-05-07T20:32:14.6937387Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:14.6937478Z x = x_sign * x_clamp 2025-05-07T20:32:14.6937558Z x0 = x[:, :D] 2025-05-07T20:32:14.6937635Z x1 = x[:, D:] 2025-05-07T20:32:14.6937715Z 2025-05-07T20:32:14.6937796Z if contiguous: 2025-05-07T20:32:14.6937894Z x0 = x0.contiguous() 2025-05-07T20:32:14.6937983Z x1 = x1.contiguous() 2025-05-07T20:32:14.6938056Z 2025-05-07T20:32:14.6938149Z if scale_ub is not None: 2025-05-07T20:32:14.6938257Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:14.6938392Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:14.6938470Z ) 2025-05-07T20:32:14.6938546Z else: 2025-05-07T20:32:14.6938638Z scale_ub_tensor = None 2025-05-07T20:32:14.6938711Z 2025-05-07T20:32:14.6938841Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:14.6938931Z op = silu_mul_quant 2025-05-07T20:32:14.6939015Z if compiled: 2025-05-07T20:32:14.6939115Z op = torch.compile(op) 2025-05-07T20:32:14.6939218Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.6939295Z 2025-05-07T20:32:14.6939384Z > y_fp8, y_scale = fn() 2025-05-07T20:32:14.6939395Z 2025-05-07T20:32:14.6939500Z moe/activation_test.py:117: 2025-05-07T20:32:14.6939632Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.6939729Z moe/activation_test.py:115: in fn 2025-05-07T20:32:14.6939932Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.6940487Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:14.6940581Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:14.6940971Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:14.6941203Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:14.6941568Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:14.6941660Z kernel = self.compile( 2025-05-07T20:32:14.6942074Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:14.6942265Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:14.6942402Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.6942407Z 2025-05-07T20:32:14.6942622Z self = 2025-05-07T20:32:14.6943466Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:14.6944008Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1cef994820>} 2025-05-07T20:32:14.6944827Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:14.6945107Z context = 2025-05-07T20:32:14.6945116Z 2025-05-07T20:32:14.6945289Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:14.6945564Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:14.6945671Z module_map=module_map) 2025-05-07T20:32:14.6945839Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:14.6945938Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:14.6946020Z E ^ 2025-05-07T20:32:14.6946404Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:14.6946409Z 2025-05-07T20:32:14.6946852Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:14.6946863Z 2025-05-07T20:32:14.6946971Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.6947209Z self=, 2025-05-07T20:32:14.6947290Z T=16384, 2025-05-07T20:32:14.6947367Z D=5120, 2025-05-07T20:32:14.6947448Z scale_ub=1200.0, 2025-05-07T20:32:14.6947539Z contiguous=True, 2025-05-07T20:32:14.6947621Z compiled=True, 2025-05-07T20:32:14.6947698Z ) 2025-05-07T20:32:14.6947927Z self = 2025-05-07T20:32:14.6948106Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:14.6948111Z 2025-05-07T20:32:14.6948187Z @given( 2025-05-07T20:32:14.6948309Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.6948406Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.6948519Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.6948648Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.6948763Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.6948837Z ) 2025-05-07T20:32:14.6949173Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.6949275Z def test_silu_mul_quant( 2025-05-07T20:32:14.6949354Z self, 2025-05-07T20:32:14.6949434Z T: int, 2025-05-07T20:32:14.6949513Z D: int, 2025-05-07T20:32:14.6949615Z scale_ub: Optional[float], 2025-05-07T20:32:14.6949793Z contiguous: bool, 2025-05-07T20:32:14.6949880Z compiled: bool, 2025-05-07T20:32:14.6949962Z ) -> None: 2025-05-07T20:32:14.6950054Z torch.manual_seed(2025) 2025-05-07T20:32:14.6950125Z 2025-05-07T20:32:14.6950301Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.6950375Z 2025-05-07T20:32:14.6950464Z x_sign = torch.sign(x) 2025-05-07T20:32:14.6950600Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:14.6950688Z x = x_sign * x_clamp 2025-05-07T20:32:14.6950772Z x0 = x[:, :D] 2025-05-07T20:32:14.6950858Z x1 = x[:, D:] 2025-05-07T20:32:14.6950932Z 2025-05-07T20:32:14.6951014Z if contiguous: 2025-05-07T20:32:14.6951104Z x0 = x0.contiguous() 2025-05-07T20:32:14.6951193Z x1 = x1.contiguous() 2025-05-07T20:32:14.6951269Z 2025-05-07T20:32:14.6951359Z if scale_ub is not None: 2025-05-07T20:32:14.6951465Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:14.6951610Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:14.6951687Z ) 2025-05-07T20:32:14.6951764Z else: 2025-05-07T20:32:14.6951862Z scale_ub_tensor = None 2025-05-07T20:32:14.6951934Z 2025-05-07T20:32:14.6952063Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:14.6952154Z op = silu_mul_quant 2025-05-07T20:32:14.6952325Z if compiled: 2025-05-07T20:32:14.6952428Z op = torch.compile(op) 2025-05-07T20:32:14.6952534Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.6952610Z 2025-05-07T20:32:14.6952704Z > y_fp8, y_scale = fn() 2025-05-07T20:32:14.6952708Z 2025-05-07T20:32:14.6952803Z moe/activation_test.py:117: 2025-05-07T20:32:14.6952934Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.6953043Z moe/activation_test.py:115: in fn 2025-05-07T20:32:14.6953143Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.6953531Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:14.6953626Z return fn(*args, **kwargs) 2025-05-07T20:32:14.6954163Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:14.6954270Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:14.6954649Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:14.6954884Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:14.6955250Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:14.6955342Z kernel = self.compile( 2025-05-07T20:32:14.6955752Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:14.6955929Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:14.6956057Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.6956062Z 2025-05-07T20:32:14.6956274Z self = 2025-05-07T20:32:14.6957117Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:14.6957754Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1cef814e50>} 2025-05-07T20:32:14.6958564Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:14.6958761Z context = 2025-05-07T20:32:14.6958773Z 2025-05-07T20:32:14.6958939Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:14.6959211Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:14.6959323Z module_map=module_map) 2025-05-07T20:32:14.6959484Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:14.6959588Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:14.6959674Z E ^ 2025-05-07T20:32:14.6960055Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:14.6960059Z 2025-05-07T20:32:14.6960505Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:14.6960510Z 2025-05-07T20:32:14.6960610Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.6960841Z self=, 2025-05-07T20:32:14.6960928Z T=16384, 2025-05-07T20:32:14.6961009Z D=5120, 2025-05-07T20:32:14.6961092Z scale_ub=None, 2025-05-07T20:32:14.6961183Z contiguous=False, 2025-05-07T20:32:14.6961376Z compiled=True, 2025-05-07T20:32:14.6961450Z ) 2025-05-07T20:32:14.6961676Z self = 2025-05-07T20:32:14.6961864Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:14.6961868Z 2025-05-07T20:32:14.6961950Z @given( 2025-05-07T20:32:14.6962066Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.6962168Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.6962292Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.6962408Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.6962522Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.6962603Z ) 2025-05-07T20:32:14.6962857Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.6962949Z def test_silu_mul_quant( 2025-05-07T20:32:14.6963029Z self, 2025-05-07T20:32:14.6963108Z T: int, 2025-05-07T20:32:14.6963193Z D: int, 2025-05-07T20:32:14.6963289Z scale_ub: Optional[float], 2025-05-07T20:32:14.6963378Z contiguous: bool, 2025-05-07T20:32:14.6963468Z compiled: bool, 2025-05-07T20:32:14.6963550Z ) -> None: 2025-05-07T20:32:14.6963647Z torch.manual_seed(2025) 2025-05-07T20:32:14.6963725Z 2025-05-07T20:32:14.6963894Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.6963968Z 2025-05-07T20:32:14.6964062Z x_sign = torch.sign(x) 2025-05-07T20:32:14.6964184Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:14.6964272Z x = x_sign * x_clamp 2025-05-07T20:32:14.6964355Z x0 = x[:, :D] 2025-05-07T20:32:14.6964435Z x1 = x[:, D:] 2025-05-07T20:32:14.6964508Z 2025-05-07T20:32:14.6964594Z if contiguous: 2025-05-07T20:32:14.6964684Z x0 = x0.contiguous() 2025-05-07T20:32:14.6964777Z x1 = x1.contiguous() 2025-05-07T20:32:14.6964856Z 2025-05-07T20:32:14.6964945Z if scale_ub is not None: 2025-05-07T20:32:14.6965052Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:14.6965184Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:14.6965341Z ) 2025-05-07T20:32:14.6965425Z else: 2025-05-07T20:32:14.6965519Z scale_ub_tensor = None 2025-05-07T20:32:14.6965593Z 2025-05-07T20:32:14.6965725Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:14.6965814Z op = silu_mul_quant 2025-05-07T20:32:14.6965901Z if compiled: 2025-05-07T20:32:14.6966006Z op = torch.compile(op) 2025-05-07T20:32:14.6966110Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.6966188Z 2025-05-07T20:32:14.6966276Z > y_fp8, y_scale = fn() 2025-05-07T20:32:14.6966281Z 2025-05-07T20:32:14.6966376Z moe/activation_test.py:117: 2025-05-07T20:32:14.6966509Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.6966614Z moe/activation_test.py:115: in fn 2025-05-07T20:32:14.6966714Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.6967117Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:14.6967210Z return fn(*args, **kwargs) 2025-05-07T20:32:14.6967751Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:14.6967847Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:14.6968227Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:14.6968465Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:14.6968825Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:14.6968918Z kernel = self.compile( 2025-05-07T20:32:14.6969415Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:14.6969597Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:14.6969731Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.6969735Z 2025-05-07T20:32:14.6969945Z self = 2025-05-07T20:32:14.6970791Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:14.6971334Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1cefa029d0>} 2025-05-07T20:32:14.6972141Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:14.6972349Z context = 2025-05-07T20:32:14.6972353Z 2025-05-07T20:32:14.6972521Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:14.6972796Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:14.6972900Z module_map=module_map) 2025-05-07T20:32:14.6973062Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:14.6973162Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:14.6973237Z E ^ 2025-05-07T20:32:14.6973615Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:14.6973620Z 2025-05-07T20:32:14.6974071Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:14.6974076Z 2025-05-07T20:32:14.6974177Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.6974486Z self=, 2025-05-07T20:32:14.6974566Z T=2048, 2025-05-07T20:32:14.6974644Z D=5120, 2025-05-07T20:32:14.6974730Z scale_ub=None, 2025-05-07T20:32:14.6974819Z contiguous=False, 2025-05-07T20:32:14.6974899Z compiled=True, 2025-05-07T20:32:14.6974974Z ) 2025-05-07T20:32:14.6975196Z self = 2025-05-07T20:32:14.6975373Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:14.6975382Z 2025-05-07T20:32:14.6975460Z @given( 2025-05-07T20:32:14.6975577Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.6975681Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.6975799Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.6975914Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.6976030Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.6976109Z ) 2025-05-07T20:32:14.6976367Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.6976465Z def test_silu_mul_quant( 2025-05-07T20:32:14.6976542Z self, 2025-05-07T20:32:14.6976620Z T: int, 2025-05-07T20:32:14.6976700Z D: int, 2025-05-07T20:32:14.6976797Z scale_ub: Optional[float], 2025-05-07T20:32:14.6976887Z contiguous: bool, 2025-05-07T20:32:14.6976972Z compiled: bool, 2025-05-07T20:32:14.6977050Z ) -> None: 2025-05-07T20:32:14.6977147Z torch.manual_seed(2025) 2025-05-07T20:32:14.6977220Z 2025-05-07T20:32:14.6977389Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.6977466Z 2025-05-07T20:32:14.6977637Z x_sign = torch.sign(x) 2025-05-07T20:32:14.6977759Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:14.6977847Z x = x_sign * x_clamp 2025-05-07T20:32:14.6977931Z x0 = x[:, :D] 2025-05-07T20:32:14.6978010Z x1 = x[:, D:] 2025-05-07T20:32:14.6978086Z 2025-05-07T20:32:14.6978167Z if contiguous: 2025-05-07T20:32:14.6978259Z x0 = x0.contiguous() 2025-05-07T20:32:14.6978350Z x1 = x1.contiguous() 2025-05-07T20:32:14.6978425Z 2025-05-07T20:32:14.6978517Z if scale_ub is not None: 2025-05-07T20:32:14.6978623Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:14.6978757Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:14.6978832Z ) 2025-05-07T20:32:14.6978909Z else: 2025-05-07T20:32:14.6979004Z scale_ub_tensor = None 2025-05-07T20:32:14.6979081Z 2025-05-07T20:32:14.6979208Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:14.6979302Z op = silu_mul_quant 2025-05-07T20:32:14.6979391Z if compiled: 2025-05-07T20:32:14.6979491Z op = torch.compile(op) 2025-05-07T20:32:14.6979598Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.6979675Z 2025-05-07T20:32:14.6979765Z > y_fp8, y_scale = fn() 2025-05-07T20:32:14.6979769Z 2025-05-07T20:32:14.6979871Z moe/activation_test.py:117: 2025-05-07T20:32:14.6980001Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.6980100Z moe/activation_test.py:115: in fn 2025-05-07T20:32:14.6980204Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.6980596Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:14.6980693Z return fn(*args, **kwargs) 2025-05-07T20:32:14.6981233Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:14.6981334Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:14.6981805Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:14.6982041Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:14.6982403Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:14.6982500Z kernel = self.compile( 2025-05-07T20:32:14.6983165Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:14.6983362Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:14.6983494Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.6983499Z 2025-05-07T20:32:14.6983708Z self = 2025-05-07T20:32:14.6984569Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:14.6985115Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1cef8b3550>} 2025-05-07T20:32:14.6985929Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:14.6986123Z context = 2025-05-07T20:32:14.6986128Z 2025-05-07T20:32:14.6986293Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:14.6986569Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:14.6986822Z module_map=module_map) 2025-05-07T20:32:14.6986988Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:14.6987091Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:14.6987164Z E ^ 2025-05-07T20:32:14.6987546Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:14.6987551Z 2025-05-07T20:32:14.6987995Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:14.6988000Z 2025-05-07T20:32:14.6988105Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.6988336Z self=, 2025-05-07T20:32:14.6988411Z T=2048, 2025-05-07T20:32:14.6988491Z D=5120, 2025-05-07T20:32:14.6988572Z scale_ub=1200.0, 2025-05-07T20:32:14.6988662Z contiguous=False, 2025-05-07T20:32:14.6988745Z compiled=True, 2025-05-07T20:32:14.6988816Z ) 2025-05-07T20:32:14.6989039Z self = 2025-05-07T20:32:14.6989226Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:14.6989231Z 2025-05-07T20:32:14.6989306Z @given( 2025-05-07T20:32:14.6989422Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.6989519Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.6989634Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.6989816Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.6989926Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.6989999Z ) 2025-05-07T20:32:14.6990258Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.6990349Z def test_silu_mul_quant( 2025-05-07T20:32:14.6990425Z self, 2025-05-07T20:32:14.6990509Z T: int, 2025-05-07T20:32:14.6990583Z D: int, 2025-05-07T20:32:14.6990680Z scale_ub: Optional[float], 2025-05-07T20:32:14.6990770Z contiguous: bool, 2025-05-07T20:32:14.6990969Z compiled: bool, 2025-05-07T20:32:14.6991052Z ) -> None: 2025-05-07T20:32:14.6991146Z torch.manual_seed(2025) 2025-05-07T20:32:14.6991218Z 2025-05-07T20:32:14.6991393Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.6991468Z 2025-05-07T20:32:14.6991560Z x_sign = torch.sign(x) 2025-05-07T20:32:14.6991683Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:14.6991771Z x = x_sign * x_clamp 2025-05-07T20:32:14.6991850Z x0 = x[:, :D] 2025-05-07T20:32:14.6991931Z x1 = x[:, D:] 2025-05-07T20:32:14.6992001Z 2025-05-07T20:32:14.6992081Z if contiguous: 2025-05-07T20:32:14.6992176Z x0 = x0.contiguous() 2025-05-07T20:32:14.6992269Z x1 = x1.contiguous() 2025-05-07T20:32:14.6992338Z 2025-05-07T20:32:14.6992431Z if scale_ub is not None: 2025-05-07T20:32:14.6992536Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:14.6992677Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:14.6992751Z ) 2025-05-07T20:32:14.6992824Z else: 2025-05-07T20:32:14.6992919Z scale_ub_tensor = None 2025-05-07T20:32:14.6992989Z 2025-05-07T20:32:14.6993115Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:14.6993207Z op = silu_mul_quant 2025-05-07T20:32:14.6993291Z if compiled: 2025-05-07T20:32:14.6993390Z op = torch.compile(op) 2025-05-07T20:32:14.6993499Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.6993570Z 2025-05-07T20:32:14.6993660Z > y_fp8, y_scale = fn() 2025-05-07T20:32:14.6993670Z 2025-05-07T20:32:14.6993768Z moe/activation_test.py:117: 2025-05-07T20:32:14.6994009Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.6994113Z moe/activation_test.py:115: in fn 2025-05-07T20:32:14.6994211Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.6994605Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:14.6994705Z return fn(*args, **kwargs) 2025-05-07T20:32:14.6995239Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:14.6995336Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:14.6995725Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:14.6995956Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:14.6996318Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:14.6996416Z kernel = self.compile( 2025-05-07T20:32:14.6996830Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:14.6997013Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:14.6997147Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.6997151Z 2025-05-07T20:32:14.6997361Z self = 2025-05-07T20:32:14.6998210Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:14.6998756Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1cef654310>} 2025-05-07T20:32:14.6999658Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:14.6999853Z context = 2025-05-07T20:32:14.6999858Z 2025-05-07T20:32:14.7000025Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:14.7000300Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:14.7000405Z module_map=module_map) 2025-05-07T20:32:14.7000573Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:14.7000670Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:14.7000746Z E ^ 2025-05-07T20:32:14.7001127Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:14.7001136Z 2025-05-07T20:32:14.7001579Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:14.7001583Z 2025-05-07T20:32:14.7001692Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.7001922Z self=, 2025-05-07T20:32:14.7001999Z T=4096, 2025-05-07T20:32:14.7002079Z D=5120, 2025-05-07T20:32:14.7002159Z scale_ub=1200.0, 2025-05-07T20:32:14.7002242Z contiguous=True, 2025-05-07T20:32:14.7002328Z compiled=True, 2025-05-07T20:32:14.7002401Z ) 2025-05-07T20:32:14.7002624Z self = 2025-05-07T20:32:14.7002805Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:14.7002809Z 2025-05-07T20:32:14.7002888Z @given( 2025-05-07T20:32:14.7003004Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.7003188Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.7003302Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.7003420Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.7003536Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.7003612Z ) 2025-05-07T20:32:14.7003873Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.7003965Z def test_silu_mul_quant( 2025-05-07T20:32:14.7004037Z self, 2025-05-07T20:32:14.7004113Z T: int, 2025-05-07T20:32:14.7004186Z D: int, 2025-05-07T20:32:14.7004281Z scale_ub: Optional[float], 2025-05-07T20:32:14.7004372Z contiguous: bool, 2025-05-07T20:32:14.7004455Z compiled: bool, 2025-05-07T20:32:14.7004537Z ) -> None: 2025-05-07T20:32:14.7004628Z torch.manual_seed(2025) 2025-05-07T20:32:14.7004700Z 2025-05-07T20:32:14.7004873Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.7004950Z 2025-05-07T20:32:14.7005040Z x_sign = torch.sign(x) 2025-05-07T20:32:14.7005164Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:14.7005253Z x = x_sign * x_clamp 2025-05-07T20:32:14.7005335Z x0 = x[:, :D] 2025-05-07T20:32:14.7005415Z x1 = x[:, D:] 2025-05-07T20:32:14.7005487Z 2025-05-07T20:32:14.7005567Z if contiguous: 2025-05-07T20:32:14.7005660Z x0 = x0.contiguous() 2025-05-07T20:32:14.7005752Z x1 = x1.contiguous() 2025-05-07T20:32:14.7005823Z 2025-05-07T20:32:14.7005914Z if scale_ub is not None: 2025-05-07T20:32:14.7006018Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:14.7006154Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:14.7006231Z ) 2025-05-07T20:32:14.7006306Z else: 2025-05-07T20:32:14.7006402Z scale_ub_tensor = None 2025-05-07T20:32:14.7006473Z 2025-05-07T20:32:14.7006605Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:14.7006697Z op = silu_mul_quant 2025-05-07T20:32:14.7006781Z if compiled: 2025-05-07T20:32:14.7006964Z op = torch.compile(op) 2025-05-07T20:32:14.7007072Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.7007143Z 2025-05-07T20:32:14.7007233Z > y_fp8, y_scale = fn() 2025-05-07T20:32:14.7007242Z 2025-05-07T20:32:14.7007337Z moe/activation_test.py:117: 2025-05-07T20:32:14.7007467Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.7007569Z moe/activation_test.py:115: in fn 2025-05-07T20:32:14.7007664Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.7008056Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:14.7008154Z return fn(*args, **kwargs) 2025-05-07T20:32:14.7008687Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:14.7008787Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:14.7009174Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:14.7009403Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:14.7009765Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:14.7009858Z kernel = self.compile( 2025-05-07T20:32:14.7010264Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:14.7010442Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:14.7010569Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.7010573Z 2025-05-07T20:32:14.7010867Z self = 2025-05-07T20:32:14.7011716Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:14.7012261Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1cef73e040>} 2025-05-07T20:32:14.7013079Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:14.7013272Z context = 2025-05-07T20:32:14.7013277Z 2025-05-07T20:32:14.7013448Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:14.7013725Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:14.7013832Z module_map=module_map) 2025-05-07T20:32:14.7013999Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:14.7014098Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:14.7014178Z E ^ 2025-05-07T20:32:14.7014554Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:14.7014558Z 2025-05-07T20:32:14.7014999Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:14.7015004Z 2025-05-07T20:32:14.7015110Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.7015340Z self=, 2025-05-07T20:32:14.7015422Z T=128, 2025-05-07T20:32:14.7015498Z D=5120, 2025-05-07T20:32:14.7015583Z scale_ub=1200.0, 2025-05-07T20:32:14.7015674Z contiguous=False, 2025-05-07T20:32:14.7015754Z compiled=True, 2025-05-07T20:32:14.7015827Z ) 2025-05-07T20:32:14.7016134Z self = 2025-05-07T20:32:14.7016310Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:14.7016315Z 2025-05-07T20:32:14.7016393Z @given( 2025-05-07T20:32:14.7016514Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.7016609Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.7016723Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.7016841Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.7016950Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.7017024Z ) 2025-05-07T20:32:14.7017279Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.7017371Z def test_silu_mul_quant( 2025-05-07T20:32:14.7017451Z self, 2025-05-07T20:32:14.7017524Z T: int, 2025-05-07T20:32:14.7017599Z D: int, 2025-05-07T20:32:14.7017704Z scale_ub: Optional[float], 2025-05-07T20:32:14.7017794Z contiguous: bool, 2025-05-07T20:32:14.7017875Z compiled: bool, 2025-05-07T20:32:14.7017956Z ) -> None: 2025-05-07T20:32:14.7018048Z torch.manual_seed(2025) 2025-05-07T20:32:14.7018121Z 2025-05-07T20:32:14.7018298Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.7018370Z 2025-05-07T20:32:14.7018461Z x_sign = torch.sign(x) 2025-05-07T20:32:14.7018583Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:14.7018668Z x = x_sign * x_clamp 2025-05-07T20:32:14.7018750Z x0 = x[:, :D] 2025-05-07T20:32:14.7018829Z x1 = x[:, D:] 2025-05-07T20:32:14.7018897Z 2025-05-07T20:32:14.7018980Z if contiguous: 2025-05-07T20:32:14.7019071Z x0 = x0.contiguous() 2025-05-07T20:32:14.7019241Z x1 = x1.contiguous() 2025-05-07T20:32:14.7019315Z 2025-05-07T20:32:14.7019405Z if scale_ub is not None: 2025-05-07T20:32:14.7019515Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:14.7019654Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:14.7019726Z ) 2025-05-07T20:32:14.7019811Z else: 2025-05-07T20:32:14.7019902Z scale_ub_tensor = None 2025-05-07T20:32:14.7019975Z 2025-05-07T20:32:14.7020106Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:14.7020194Z op = silu_mul_quant 2025-05-07T20:32:14.7020276Z if compiled: 2025-05-07T20:32:14.7020377Z op = torch.compile(op) 2025-05-07T20:32:14.7020480Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.7020553Z 2025-05-07T20:32:14.7020647Z > y_fp8, y_scale = fn() 2025-05-07T20:32:14.7020652Z 2025-05-07T20:32:14.7020751Z moe/activation_test.py:117: 2025-05-07T20:32:14.7020884Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.7020983Z moe/activation_test.py:115: in fn 2025-05-07T20:32:14.7021083Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.7021476Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:14.7021567Z return fn(*args, **kwargs) 2025-05-07T20:32:14.7022101Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:14.7022199Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:14.7022577Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:14.7022812Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:14.7023172Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:14.7023269Z kernel = self.compile( 2025-05-07T20:32:14.7023786Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:14.7023965Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:14.7024094Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.7024099Z 2025-05-07T20:32:14.7024311Z self = 2025-05-07T20:32:14.7025156Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:14.7025704Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1cef73eca0>} 2025-05-07T20:32:14.7026522Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:14.7026720Z context = 2025-05-07T20:32:14.7026724Z 2025-05-07T20:32:14.7026890Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:14.7027163Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:14.7027273Z module_map=module_map) 2025-05-07T20:32:14.7027435Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:14.7027533Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:14.7027611Z E ^ 2025-05-07T20:32:14.7027988Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:14.7028072Z 2025-05-07T20:32:14.7028524Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:14.7028529Z 2025-05-07T20:32:14.7028631Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.7028860Z self=, 2025-05-07T20:32:14.7028940Z T=16384, 2025-05-07T20:32:14.7029015Z D=7168, 2025-05-07T20:32:14.7029097Z scale_ub=1200.0, 2025-05-07T20:32:14.7029185Z contiguous=True, 2025-05-07T20:32:14.7029267Z compiled=True, 2025-05-07T20:32:14.7029344Z ) 2025-05-07T20:32:14.7029567Z self = 2025-05-07T20:32:14.7029855Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:14.7029861Z 2025-05-07T20:32:14.7029950Z @given( 2025-05-07T20:32:14.7030088Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.7030189Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.7030307Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.7030425Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.7030541Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.7030615Z ) 2025-05-07T20:32:14.7030871Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.7030968Z def test_silu_mul_quant( 2025-05-07T20:32:14.7031044Z self, 2025-05-07T20:32:14.7031120Z T: int, 2025-05-07T20:32:14.7031197Z D: int, 2025-05-07T20:32:14.7031295Z scale_ub: Optional[float], 2025-05-07T20:32:14.7031383Z contiguous: bool, 2025-05-07T20:32:14.7031472Z compiled: bool, 2025-05-07T20:32:14.7031551Z ) -> None: 2025-05-07T20:32:14.7031642Z torch.manual_seed(2025) 2025-05-07T20:32:14.7031722Z 2025-05-07T20:32:14.7031901Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.7031970Z 2025-05-07T20:32:14.7032064Z x_sign = torch.sign(x) 2025-05-07T20:32:14.7032342Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:14.7032434Z x = x_sign * x_clamp 2025-05-07T20:32:14.7032513Z x0 = x[:, :D] 2025-05-07T20:32:14.7032590Z x1 = x[:, D:] 2025-05-07T20:32:14.7032671Z 2025-05-07T20:32:14.7032753Z if contiguous: 2025-05-07T20:32:14.7032843Z x0 = x0.contiguous() 2025-05-07T20:32:14.7032936Z x1 = x1.contiguous() 2025-05-07T20:32:14.7033011Z 2025-05-07T20:32:14.7033101Z if scale_ub is not None: 2025-05-07T20:32:14.7033210Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:14.7033341Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:14.7033417Z ) 2025-05-07T20:32:14.7033494Z else: 2025-05-07T20:32:14.7033585Z scale_ub_tensor = None 2025-05-07T20:32:14.7033664Z 2025-05-07T20:32:14.7033789Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:14.7033877Z op = silu_mul_quant 2025-05-07T20:32:14.7033966Z if compiled: 2025-05-07T20:32:14.7034064Z op = torch.compile(op) 2025-05-07T20:32:14.7034167Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.7034245Z 2025-05-07T20:32:14.7034333Z > y_fp8, y_scale = fn() 2025-05-07T20:32:14.7034338Z 2025-05-07T20:32:14.7034433Z moe/activation_test.py:117: 2025-05-07T20:32:14.7034565Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.7034662Z moe/activation_test.py:115: in fn 2025-05-07T20:32:14.7034761Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.7035151Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:14.7035243Z return fn(*args, **kwargs) 2025-05-07T20:32:14.7035869Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:14.7035964Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:14.7036348Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:14.7036581Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:14.7036942Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:14.7037038Z kernel = self.compile( 2025-05-07T20:32:14.7037444Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:14.7037621Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:14.7037751Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.7037762Z 2025-05-07T20:32:14.7037971Z self = 2025-05-07T20:32:14.7038821Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:14.7039362Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1cef7fea60>} 2025-05-07T20:32:14.7040175Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:14.7040372Z context = 2025-05-07T20:32:14.7040376Z 2025-05-07T20:32:14.7040543Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:14.7040822Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:14.7041001Z module_map=module_map) 2025-05-07T20:32:14.7041163Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:14.7041263Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:14.7041339Z E ^ 2025-05-07T20:32:14.7041718Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:14.7041726Z 2025-05-07T20:32:14.7042169Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:14.7042174Z 2025-05-07T20:32:14.7042272Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.7042509Z self=, 2025-05-07T20:32:14.7042584Z T=16384, 2025-05-07T20:32:14.7042661Z D=5120, 2025-05-07T20:32:14.7042744Z scale_ub=1200.0, 2025-05-07T20:32:14.7042825Z contiguous=True, 2025-05-07T20:32:14.7042908Z compiled=False, 2025-05-07T20:32:14.7042983Z ) 2025-05-07T20:32:14.7043210Z self = 2025-05-07T20:32:14.7043396Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:14.7043401Z 2025-05-07T20:32:14.7043480Z @given( 2025-05-07T20:32:14.7043597Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.7043700Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.7043814Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.7043927Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.7044041Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.7044114Z ) 2025-05-07T20:32:14.7044375Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.7044546Z def test_silu_mul_quant( 2025-05-07T20:32:14.7044621Z self, 2025-05-07T20:32:14.7044698Z T: int, 2025-05-07T20:32:14.7044771Z D: int, 2025-05-07T20:32:14.7044875Z scale_ub: Optional[float], 2025-05-07T20:32:14.7044966Z contiguous: bool, 2025-05-07T20:32:14.7045050Z compiled: bool, 2025-05-07T20:32:14.7045125Z ) -> None: 2025-05-07T20:32:14.7045221Z torch.manual_seed(2025) 2025-05-07T20:32:14.7045294Z 2025-05-07T20:32:14.7045463Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.7045537Z 2025-05-07T20:32:14.7045626Z x_sign = torch.sign(x) 2025-05-07T20:32:14.7045748Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:14.7045840Z x = x_sign * x_clamp 2025-05-07T20:32:14.7045920Z x0 = x[:, :D] 2025-05-07T20:32:14.7045999Z x1 = x[:, D:] 2025-05-07T20:32:14.7046071Z 2025-05-07T20:32:14.7046159Z if contiguous: 2025-05-07T20:32:14.7046255Z x0 = x0.contiguous() 2025-05-07T20:32:14.7046342Z x1 = x1.contiguous() 2025-05-07T20:32:14.7046417Z 2025-05-07T20:32:14.7046513Z if scale_ub is not None: 2025-05-07T20:32:14.7046616Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:14.7046749Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:14.7046826Z ) 2025-05-07T20:32:14.7046901Z else: 2025-05-07T20:32:14.7046995Z scale_ub_tensor = None 2025-05-07T20:32:14.7047069Z 2025-05-07T20:32:14.7047198Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:14.7047290Z op = silu_mul_quant 2025-05-07T20:32:14.7047372Z if compiled: 2025-05-07T20:32:14.7047470Z op = torch.compile(op) 2025-05-07T20:32:14.7047582Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.7047656Z 2025-05-07T20:32:14.7047744Z > y_fp8, y_scale = fn() 2025-05-07T20:32:14.7047754Z 2025-05-07T20:32:14.7047854Z moe/activation_test.py:117: 2025-05-07T20:32:14.7047985Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.7048166Z moe/activation_test.py:115: in fn 2025-05-07T20:32:14.7048268Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.7048803Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:14.7048904Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:14.7049285Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:14.7049517Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:14.7049885Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:14.7049976Z kernel = self.compile( 2025-05-07T20:32:14.7050391Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:14.7050572Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:14.7050703Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.7050708Z 2025-05-07T20:32:14.7050925Z self = 2025-05-07T20:32:14.7051769Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:14.7052317Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1cef6ec550>} 2025-05-07T20:32:14.7053126Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:14.7053408Z context = 2025-05-07T20:32:14.7053412Z 2025-05-07T20:32:14.7053588Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:14.7053863Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:14.7053974Z module_map=module_map) 2025-05-07T20:32:14.7054134Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:14.7054230Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:14.7054305Z E ^ 2025-05-07T20:32:14.7054682Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:14.7054686Z 2025-05-07T20:32:14.7055128Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:14.7055140Z 2025-05-07T20:32:14.7055241Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.7055475Z self=, 2025-05-07T20:32:14.7055556Z T=1, 2025-05-07T20:32:14.7055630Z D=7168, 2025-05-07T20:32:14.7055710Z scale_ub=1200.0, 2025-05-07T20:32:14.7055796Z contiguous=False, 2025-05-07T20:32:14.7055875Z compiled=False, 2025-05-07T20:32:14.7055945Z ) 2025-05-07T20:32:14.7056176Z self = 2025-05-07T20:32:14.7056346Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:14.7056351Z 2025-05-07T20:32:14.7056424Z @given( 2025-05-07T20:32:14.7056547Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.7056642Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.7056764Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.7056878Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.7056989Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.7057207Z ) 2025-05-07T20:32:14.7057467Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.7057557Z def test_silu_mul_quant( 2025-05-07T20:32:14.7057639Z self, 2025-05-07T20:32:14.7057714Z T: int, 2025-05-07T20:32:14.7057789Z D: int, 2025-05-07T20:32:14.7057889Z scale_ub: Optional[float], 2025-05-07T20:32:14.7057975Z contiguous: bool, 2025-05-07T20:32:14.7058063Z compiled: bool, 2025-05-07T20:32:14.7058140Z ) -> None: 2025-05-07T20:32:14.7058232Z torch.manual_seed(2025) 2025-05-07T20:32:14.7058305Z 2025-05-07T20:32:14.7058476Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.7058549Z 2025-05-07T20:32:14.7058650Z x_sign = torch.sign(x) 2025-05-07T20:32:14.7061946Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:14.7062052Z x = x_sign * x_clamp 2025-05-07T20:32:14.7062137Z x0 = x[:, :D] 2025-05-07T20:32:14.7062226Z x1 = x[:, D:] 2025-05-07T20:32:14.7062305Z 2025-05-07T20:32:14.7062388Z if contiguous: 2025-05-07T20:32:14.7062484Z x0 = x0.contiguous() 2025-05-07T20:32:14.7062574Z x1 = x1.contiguous() 2025-05-07T20:32:14.7062647Z 2025-05-07T20:32:14.7062741Z if scale_ub is not None: 2025-05-07T20:32:14.7062846Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:14.7062982Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:14.7063056Z ) 2025-05-07T20:32:14.7063131Z else: 2025-05-07T20:32:14.7063222Z scale_ub_tensor = None 2025-05-07T20:32:14.7063300Z 2025-05-07T20:32:14.7063429Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:14.7063625Z op = silu_mul_quant 2025-05-07T20:32:14.7063713Z if compiled: 2025-05-07T20:32:14.7063816Z op = torch.compile(op) 2025-05-07T20:32:14.7063929Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.7064001Z 2025-05-07T20:32:14.7064091Z > y_fp8, y_scale = fn() 2025-05-07T20:32:14.7064097Z 2025-05-07T20:32:14.7064199Z moe/activation_test.py:117: 2025-05-07T20:32:14.7064329Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.7064432Z moe/activation_test.py:115: in fn 2025-05-07T20:32:14.7064533Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.7065079Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:14.7065177Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:14.7065563Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:14.7065798Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:14.7066166Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:14.7066260Z kernel = self.compile( 2025-05-07T20:32:14.7066670Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:14.7066854Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:14.7066984Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.7066988Z 2025-05-07T20:32:14.7067202Z self = 2025-05-07T20:32:14.7068051Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:14.7068680Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1cef7fee50>} 2025-05-07T20:32:14.7069502Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:14.7069797Z context = 2025-05-07T20:32:14.7069802Z 2025-05-07T20:32:14.7069973Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:14.7070247Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:14.7070353Z module_map=module_map) 2025-05-07T20:32:14.7070520Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:14.7070621Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:14.7070702Z E ^ 2025-05-07T20:32:14.7071089Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:14.7071094Z 2025-05-07T20:32:14.7071537Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:14.7071542Z 2025-05-07T20:32:14.7071645Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.7071875Z self=, 2025-05-07T20:32:14.7071957Z T=4096, 2025-05-07T20:32:14.7072035Z D=7168, 2025-05-07T20:32:14.7072116Z scale_ub=1200.0, 2025-05-07T20:32:14.7072202Z contiguous=False, 2025-05-07T20:32:14.7072285Z compiled=True, 2025-05-07T20:32:14.7072357Z ) 2025-05-07T20:32:14.7072587Z self = 2025-05-07T20:32:14.7072850Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:14.7072855Z 2025-05-07T20:32:14.7072927Z @given( 2025-05-07T20:32:14.7073054Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.7073152Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.7073266Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.7073384Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.7073496Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.7073574Z ) 2025-05-07T20:32:14.7073833Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.7073926Z def test_silu_mul_quant( 2025-05-07T20:32:14.7074005Z self, 2025-05-07T20:32:14.7074080Z T: int, 2025-05-07T20:32:14.7074155Z D: int, 2025-05-07T20:32:14.7074256Z scale_ub: Optional[float], 2025-05-07T20:32:14.7074344Z contiguous: bool, 2025-05-07T20:32:14.7074429Z compiled: bool, 2025-05-07T20:32:14.7074511Z ) -> None: 2025-05-07T20:32:14.7074605Z torch.manual_seed(2025) 2025-05-07T20:32:14.7074677Z 2025-05-07T20:32:14.7074854Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.7074928Z 2025-05-07T20:32:14.7075018Z x_sign = torch.sign(x) 2025-05-07T20:32:14.7075148Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:14.7075239Z x = x_sign * x_clamp 2025-05-07T20:32:14.7075319Z x0 = x[:, :D] 2025-05-07T20:32:14.7075395Z x1 = x[:, D:] 2025-05-07T20:32:14.7075468Z 2025-05-07T20:32:14.7075552Z if contiguous: 2025-05-07T20:32:14.7075641Z x0 = x0.contiguous() 2025-05-07T20:32:14.7075728Z x1 = x1.contiguous() 2025-05-07T20:32:14.7075803Z 2025-05-07T20:32:14.7075895Z if scale_ub is not None: 2025-05-07T20:32:14.7075998Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:14.7076141Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:14.7076216Z ) 2025-05-07T20:32:14.7076290Z else: 2025-05-07T20:32:14.7076385Z scale_ub_tensor = None 2025-05-07T20:32:14.7076545Z 2025-05-07T20:32:14.7076676Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:14.7076768Z op = silu_mul_quant 2025-05-07T20:32:14.7076851Z if compiled: 2025-05-07T20:32:14.7076952Z op = torch.compile(op) 2025-05-07T20:32:14.7077057Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.7077127Z 2025-05-07T20:32:14.7077220Z > y_fp8, y_scale = fn() 2025-05-07T20:32:14.7077225Z 2025-05-07T20:32:14.7077319Z moe/activation_test.py:117: 2025-05-07T20:32:14.7077450Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.7077552Z moe/activation_test.py:115: in fn 2025-05-07T20:32:14.7077648Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.7078046Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:14.7078137Z return fn(*args, **kwargs) 2025-05-07T20:32:14.7078675Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:14.7078774Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:14.7079157Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:14.7079387Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:14.7079752Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:14.7079843Z kernel = self.compile( 2025-05-07T20:32:14.7080252Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:14.7080513Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:14.7080642Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.7080647Z 2025-05-07T20:32:14.7080863Z self = 2025-05-07T20:32:14.7081710Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:14.7082259Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1cef41eee0>} 2025-05-07T20:32:14.7083393Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:14.7083603Z context = 2025-05-07T20:32:14.7083613Z 2025-05-07T20:32:14.7083788Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:14.7084067Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:14.7084178Z module_map=module_map) 2025-05-07T20:32:14.7084342Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:14.7084443Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:14.7084526Z E ^ 2025-05-07T20:32:14.7084908Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:14.7084912Z 2025-05-07T20:32:14.7085364Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:14.7085369Z 2025-05-07T20:32:14.7085479Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.7085712Z self=, 2025-05-07T20:32:14.7085798Z T=128, 2025-05-07T20:32:14.7086026Z D=7168, 2025-05-07T20:32:14.7086110Z scale_ub=1200.0, 2025-05-07T20:32:14.7086197Z contiguous=False, 2025-05-07T20:32:14.7086278Z compiled=True, 2025-05-07T20:32:14.7086349Z ) 2025-05-07T20:32:14.7086576Z self = 2025-05-07T20:32:14.7086752Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:14.7086757Z 2025-05-07T20:32:14.7086836Z @given( 2025-05-07T20:32:14.7086952Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.7087049Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.7087166Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.7087279Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.7087392Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.7087470Z ) 2025-05-07T20:32:14.7087726Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.7087823Z def test_silu_mul_quant( 2025-05-07T20:32:14.7087901Z self, 2025-05-07T20:32:14.7087979Z T: int, 2025-05-07T20:32:14.7088059Z D: int, 2025-05-07T20:32:14.7088156Z scale_ub: Optional[float], 2025-05-07T20:32:14.7088245Z contiguous: bool, 2025-05-07T20:32:14.7088335Z compiled: bool, 2025-05-07T20:32:14.7088414Z ) -> None: 2025-05-07T20:32:14.7088508Z torch.manual_seed(2025) 2025-05-07T20:32:14.7088584Z 2025-05-07T20:32:14.7088754Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.7088826Z 2025-05-07T20:32:14.7088918Z x_sign = torch.sign(x) 2025-05-07T20:32:14.7089044Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:14.7089280Z x = x_sign * x_clamp 2025-05-07T20:32:14.7089362Z x0 = x[:, :D] 2025-05-07T20:32:14.7089440Z x1 = x[:, D:] 2025-05-07T20:32:14.7089516Z 2025-05-07T20:32:14.7089595Z if contiguous: 2025-05-07T20:32:14.7089689Z x0 = x0.contiguous() 2025-05-07T20:32:14.7089778Z x1 = x1.contiguous() 2025-05-07T20:32:14.7089850Z 2025-05-07T20:32:14.7089941Z if scale_ub is not None: 2025-05-07T20:32:14.7090047Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:14.7090181Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:14.7090252Z ) 2025-05-07T20:32:14.7090331Z else: 2025-05-07T20:32:14.7090423Z scale_ub_tensor = None 2025-05-07T20:32:14.7090497Z 2025-05-07T20:32:14.7090627Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:14.7090718Z op = silu_mul_quant 2025-05-07T20:32:14.7090803Z if compiled: 2025-05-07T20:32:14.7090904Z op = torch.compile(op) 2025-05-07T20:32:14.7091013Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.7091088Z 2025-05-07T20:32:14.7091175Z > y_fp8, y_scale = fn() 2025-05-07T20:32:14.7091179Z 2025-05-07T20:32:14.7091280Z moe/activation_test.py:117: 2025-05-07T20:32:14.7091413Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.7091511Z moe/activation_test.py:115: in fn 2025-05-07T20:32:14.7091607Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.7092002Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:14.7092093Z return fn(*args, **kwargs) 2025-05-07T20:32:14.7092634Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:14.7092731Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:14.7093109Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:14.7093348Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:14.7093790Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:14.7093885Z kernel = self.compile( 2025-05-07T20:32:14.7094294Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:14.7094471Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:14.7094602Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.7094606Z 2025-05-07T20:32:14.7094817Z self = 2025-05-07T20:32:14.7095663Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:14.7096220Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1cef3c9af0>} 2025-05-07T20:32:14.7097030Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:14.7097229Z context = 2025-05-07T20:32:14.7097233Z 2025-05-07T20:32:14.7097400Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:14.7097675Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:14.7097782Z module_map=module_map) 2025-05-07T20:32:14.7098023Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:14.7098127Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:14.7098199Z E ^ 2025-05-07T20:32:14.7098581Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:14.7098586Z 2025-05-07T20:32:14.7099033Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:14.7099037Z 2025-05-07T20:32:14.7099138Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.7099370Z self=, 2025-05-07T20:32:14.7099447Z T=2048, 2025-05-07T20:32:14.7099520Z D=7168, 2025-05-07T20:32:14.7099602Z scale_ub=None, 2025-05-07T20:32:14.7099685Z contiguous=True, 2025-05-07T20:32:14.7099766Z compiled=True, 2025-05-07T20:32:14.7099840Z ) 2025-05-07T20:32:14.7100063Z self = 2025-05-07T20:32:14.7100242Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:14.7100249Z 2025-05-07T20:32:14.7100325Z @given( 2025-05-07T20:32:14.7100447Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.7100547Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.7100660Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.7100775Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.7100890Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.7100962Z ) 2025-05-07T20:32:14.7101217Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.7101314Z def test_silu_mul_quant( 2025-05-07T20:32:14.7101390Z self, 2025-05-07T20:32:14.7101468Z T: int, 2025-05-07T20:32:14.7101546Z D: int, 2025-05-07T20:32:14.7101642Z scale_ub: Optional[float], 2025-05-07T20:32:14.7101737Z contiguous: bool, 2025-05-07T20:32:14.7101820Z compiled: bool, 2025-05-07T20:32:14.7101896Z ) -> None: 2025-05-07T20:32:14.7101990Z torch.manual_seed(2025) 2025-05-07T20:32:14.7102062Z 2025-05-07T20:32:14.7102314Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.7102389Z 2025-05-07T20:32:14.7102477Z x_sign = torch.sign(x) 2025-05-07T20:32:14.7102600Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:14.7102693Z x = x_sign * x_clamp 2025-05-07T20:32:14.7102770Z x0 = x[:, :D] 2025-05-07T20:32:14.7102850Z x1 = x[:, D:] 2025-05-07T20:32:14.7102924Z 2025-05-07T20:32:14.7103005Z if contiguous: 2025-05-07T20:32:14.7103092Z x0 = x0.contiguous() 2025-05-07T20:32:14.7103187Z x1 = x1.contiguous() 2025-05-07T20:32:14.7103261Z 2025-05-07T20:32:14.7103356Z if scale_ub is not None: 2025-05-07T20:32:14.7103460Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:14.7103597Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:14.7103674Z ) 2025-05-07T20:32:14.7103747Z else: 2025-05-07T20:32:14.7103844Z scale_ub_tensor = None 2025-05-07T20:32:14.7103919Z 2025-05-07T20:32:14.7104048Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:14.7104138Z op = silu_mul_quant 2025-05-07T20:32:14.7104226Z if compiled: 2025-05-07T20:32:14.7104326Z op = torch.compile(op) 2025-05-07T20:32:14.7104429Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.7104506Z 2025-05-07T20:32:14.7104596Z > y_fp8, y_scale = fn() 2025-05-07T20:32:14.7104601Z 2025-05-07T20:32:14.7104697Z moe/activation_test.py:117: 2025-05-07T20:32:14.7104826Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.7104925Z moe/activation_test.py:115: in fn 2025-05-07T20:32:14.7105111Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.7105500Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:14.7105597Z return fn(*args, **kwargs) 2025-05-07T20:32:14.7106134Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:14.7106229Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:14.7106610Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:14.7106839Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:14.7107199Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:14.7107299Z kernel = self.compile( 2025-05-07T20:32:14.7107705Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:14.7107889Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:14.7108022Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.7108027Z 2025-05-07T20:32:14.7108239Z self = 2025-05-07T20:32:14.7109088Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:14.7109634Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1cef4c68b0>} 2025-05-07T20:32:14.7110544Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:14.7110747Z context = 2025-05-07T20:32:14.7110752Z 2025-05-07T20:32:14.7111001Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:14.7111281Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:14.7111387Z module_map=module_map) 2025-05-07T20:32:14.7111552Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:14.7111648Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:14.7111726Z E ^ 2025-05-07T20:32:14.7112105Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:14.7112110Z 2025-05-07T20:32:14.7112552Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:14.7112561Z 2025-05-07T20:32:14.7112665Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.7112893Z self=, 2025-05-07T20:32:14.7112972Z T=16384, 2025-05-07T20:32:14.7113050Z D=5120, 2025-05-07T20:32:14.7113129Z scale_ub=None, 2025-05-07T20:32:14.7113215Z contiguous=False, 2025-05-07T20:32:14.7113303Z compiled=False, 2025-05-07T20:32:14.7113377Z ) 2025-05-07T20:32:14.7113599Z self = 2025-05-07T20:32:14.7113782Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:14.7113786Z 2025-05-07T20:32:14.7113866Z @given( 2025-05-07T20:32:14.7113986Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.7114084Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.7114196Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.7114318Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.7114518Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.7114592Z ) 2025-05-07T20:32:14.7114854Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.7114945Z def test_silu_mul_quant( 2025-05-07T20:32:14.7115020Z self, 2025-05-07T20:32:14.7115104Z T: int, 2025-05-07T20:32:14.7115181Z D: int, 2025-05-07T20:32:14.7115275Z scale_ub: Optional[float], 2025-05-07T20:32:14.7115364Z contiguous: bool, 2025-05-07T20:32:14.7115448Z compiled: bool, 2025-05-07T20:32:14.7115531Z ) -> None: 2025-05-07T20:32:14.7115624Z torch.manual_seed(2025) 2025-05-07T20:32:14.7115695Z 2025-05-07T20:32:14.7115869Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.7115941Z 2025-05-07T20:32:14.7116031Z x_sign = torch.sign(x) 2025-05-07T20:32:14.7116159Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:14.7118155Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 144.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 40.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:14.7118162Z 2025-05-07T20:32:14.7118281Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:14.7118285Z 2025-05-07T20:32:14.7118385Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.7118612Z self=, 2025-05-07T20:32:14.7118691Z T=4096, 2025-05-07T20:32:14.7118767Z D=7168, 2025-05-07T20:32:14.7118854Z scale_ub=1200.0, 2025-05-07T20:32:14.7118939Z contiguous=True, 2025-05-07T20:32:14.7119020Z compiled=True, 2025-05-07T20:32:14.7119097Z ) 2025-05-07T20:32:14.7119399Z self = 2025-05-07T20:32:14.7119576Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:14.7119580Z 2025-05-07T20:32:14.7119661Z @given( 2025-05-07T20:32:14.7119777Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.7119870Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.7119985Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.7120101Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.7120216Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.7120290Z ) 2025-05-07T20:32:14.7120546Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.7120645Z def test_silu_mul_quant( 2025-05-07T20:32:14.7120720Z self, 2025-05-07T20:32:14.7120796Z T: int, 2025-05-07T20:32:14.7120873Z D: int, 2025-05-07T20:32:14.7120969Z scale_ub: Optional[float], 2025-05-07T20:32:14.7121058Z contiguous: bool, 2025-05-07T20:32:14.7121146Z compiled: bool, 2025-05-07T20:32:14.7121224Z ) -> None: 2025-05-07T20:32:14.7121317Z torch.manual_seed(2025) 2025-05-07T20:32:14.7121393Z 2025-05-07T20:32:14.7121562Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.7121640Z 2025-05-07T20:32:14.7121732Z x_sign = torch.sign(x) 2025-05-07T20:32:14.7121854Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:14.7123826Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 136.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:14.7123939Z 2025-05-07T20:32:14.7124058Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:14.7124063Z 2025-05-07T20:32:14.7124165Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.7124392Z self=, 2025-05-07T20:32:14.7124469Z T=16384, 2025-05-07T20:32:14.7124547Z D=7168, 2025-05-07T20:32:14.7124627Z scale_ub=None, 2025-05-07T20:32:14.7124714Z contiguous=False, 2025-05-07T20:32:14.7124798Z compiled=False, 2025-05-07T20:32:14.7124870Z ) 2025-05-07T20:32:14.7125094Z self = 2025-05-07T20:32:14.7125288Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:14.7125293Z 2025-05-07T20:32:14.7125370Z @given( 2025-05-07T20:32:14.7125491Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.7125589Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.7125701Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.7125813Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.7125927Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.7125998Z ) 2025-05-07T20:32:14.7126258Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.7126348Z def test_silu_mul_quant( 2025-05-07T20:32:14.7126423Z self, 2025-05-07T20:32:14.7126505Z T: int, 2025-05-07T20:32:14.7126576Z D: int, 2025-05-07T20:32:14.7126671Z scale_ub: Optional[float], 2025-05-07T20:32:14.7126758Z contiguous: bool, 2025-05-07T20:32:14.7126846Z compiled: bool, 2025-05-07T20:32:14.7126923Z ) -> None: 2025-05-07T20:32:14.7127022Z torch.manual_seed(2025) 2025-05-07T20:32:14.7127098Z 2025-05-07T20:32:14.7127343Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.7129313Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 144.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 136.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:14.7129319Z 2025-05-07T20:32:14.7129433Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:14.7129444Z 2025-05-07T20:32:14.7129542Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.7129771Z self=, 2025-05-07T20:32:14.7129850Z T=2048, 2025-05-07T20:32:14.7129958Z D=7168, 2025-05-07T20:32:14.7130072Z scale_ub=1200.0, 2025-05-07T20:32:14.7130193Z contiguous=True, 2025-05-07T20:32:14.7130301Z compiled=True, 2025-05-07T20:32:14.7130379Z ) 2025-05-07T20:32:14.7130605Z self = 2025-05-07T20:32:14.7130776Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:14.7130781Z 2025-05-07T20:32:14.7130861Z @given( 2025-05-07T20:32:14.7130977Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.7131075Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.7131190Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.7131305Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.7131507Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.7131583Z ) 2025-05-07T20:32:14.7131841Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.7131932Z def test_silu_mul_quant( 2025-05-07T20:32:14.7132013Z self, 2025-05-07T20:32:14.7132089Z T: int, 2025-05-07T20:32:14.7132164Z D: int, 2025-05-07T20:32:14.7132264Z scale_ub: Optional[float], 2025-05-07T20:32:14.7132351Z contiguous: bool, 2025-05-07T20:32:14.7132436Z compiled: bool, 2025-05-07T20:32:14.7132513Z ) -> None: 2025-05-07T20:32:14.7132605Z torch.manual_seed(2025) 2025-05-07T20:32:14.7132680Z 2025-05-07T20:32:14.7132846Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.7132918Z 2025-05-07T20:32:14.7133012Z x_sign = torch.sign(x) 2025-05-07T20:32:14.7133135Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:14.7135098Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 80.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:14.7135104Z 2025-05-07T20:32:14.7135218Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:14.7135223Z 2025-05-07T20:32:14.7135323Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.7135554Z self=, 2025-05-07T20:32:14.7135631Z T=2048, 2025-05-07T20:32:14.7135714Z D=7168, 2025-05-07T20:32:14.7135798Z scale_ub=None, 2025-05-07T20:32:14.7135882Z contiguous=True, 2025-05-07T20:32:14.7135965Z compiled=False, 2025-05-07T20:32:14.7136037Z ) 2025-05-07T20:32:14.7136342Z self = 2025-05-07T20:32:14.7136524Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:14.7136528Z 2025-05-07T20:32:14.7136605Z @given( 2025-05-07T20:32:14.7136719Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.7136819Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.7136931Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.7137049Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.7137161Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.7137235Z ) 2025-05-07T20:32:14.7137491Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.7137581Z def test_silu_mul_quant( 2025-05-07T20:32:14.7137657Z self, 2025-05-07T20:32:14.7137737Z T: int, 2025-05-07T20:32:14.7137811Z D: int, 2025-05-07T20:32:14.7137904Z scale_ub: Optional[float], 2025-05-07T20:32:14.7137999Z contiguous: bool, 2025-05-07T20:32:14.7138082Z compiled: bool, 2025-05-07T20:32:14.7138158Z ) -> None: 2025-05-07T20:32:14.7138255Z torch.manual_seed(2025) 2025-05-07T20:32:14.7138325Z 2025-05-07T20:32:14.7138498Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.7138571Z 2025-05-07T20:32:14.7138662Z > x_sign = torch.sign(x) 2025-05-07T20:32:14.7140615Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 80.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:14.7140704Z 2025-05-07T20:32:14.7140823Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:32:14.7140827Z 2025-05-07T20:32:14.7140928Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.7141153Z self=, 2025-05-07T20:32:14.7141227Z T=1, 2025-05-07T20:32:14.7141306Z D=7168, 2025-05-07T20:32:14.7141389Z scale_ub=1200.0, 2025-05-07T20:32:14.7141473Z contiguous=True, 2025-05-07T20:32:14.7141559Z compiled=False, 2025-05-07T20:32:14.7141627Z ) 2025-05-07T20:32:14.7141848Z self = 2025-05-07T20:32:14.7142021Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:14.7142031Z 2025-05-07T20:32:14.7142109Z @given( 2025-05-07T20:32:14.7142227Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.7142322Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.7142439Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.7142558Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.7142669Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.7142741Z ) 2025-05-07T20:32:14.7143002Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.7143093Z def test_silu_mul_quant( 2025-05-07T20:32:14.7143171Z self, 2025-05-07T20:32:14.7143245Z T: int, 2025-05-07T20:32:14.7143320Z D: int, 2025-05-07T20:32:14.7143417Z scale_ub: Optional[float], 2025-05-07T20:32:14.7143504Z contiguous: bool, 2025-05-07T20:32:14.7143587Z compiled: bool, 2025-05-07T20:32:14.7143665Z ) -> None: 2025-05-07T20:32:14.7143762Z torch.manual_seed(2025) 2025-05-07T20:32:14.7143832Z 2025-05-07T20:32:14.7144007Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.7144077Z 2025-05-07T20:32:14.7144247Z x_sign = torch.sign(x) 2025-05-07T20:32:14.7144375Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:14.7144463Z x = x_sign * x_clamp 2025-05-07T20:32:14.7144543Z x0 = x[:, :D] 2025-05-07T20:32:14.7144627Z x1 = x[:, D:] 2025-05-07T20:32:14.7144699Z 2025-05-07T20:32:14.7144785Z if contiguous: 2025-05-07T20:32:14.7144877Z x0 = x0.contiguous() 2025-05-07T20:32:14.7144966Z x1 = x1.contiguous() 2025-05-07T20:32:14.7145043Z 2025-05-07T20:32:14.7145131Z if scale_ub is not None: 2025-05-07T20:32:14.7145234Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:14.7145372Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:14.7145445Z ) 2025-05-07T20:32:14.7145523Z else: 2025-05-07T20:32:14.7145620Z scale_ub_tensor = None 2025-05-07T20:32:14.7145692Z 2025-05-07T20:32:14.7145819Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:14.7145917Z op = silu_mul_quant 2025-05-07T20:32:14.7146000Z if compiled: 2025-05-07T20:32:14.7146105Z op = torch.compile(op) 2025-05-07T20:32:14.7146207Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.7146277Z 2025-05-07T20:32:14.7146371Z > y_fp8, y_scale = fn() 2025-05-07T20:32:14.7146376Z 2025-05-07T20:32:14.7146471Z moe/activation_test.py:117: 2025-05-07T20:32:14.7146602Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.7146708Z moe/activation_test.py:115: in fn 2025-05-07T20:32:14.7146807Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.7147350Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:14.7147532Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:14.7147920Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:14.7148157Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:14.7148519Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:14.7148612Z kernel = self.compile( 2025-05-07T20:32:14.7149024Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:14.7149203Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:14.7149335Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.7149339Z 2025-05-07T20:32:14.7149548Z self = 2025-05-07T20:32:14.7150542Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:14.7151092Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1cef34f550>} 2025-05-07T20:32:14.7151901Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:14.7152101Z context = 2025-05-07T20:32:14.7152105Z 2025-05-07T20:32:14.7152271Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:14.7152543Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:14.7152657Z module_map=module_map) 2025-05-07T20:32:14.7152820Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:14.7153020Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:14.7153098Z E ^ 2025-05-07T20:32:14.7153474Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:14.7153479Z 2025-05-07T20:32:14.7153925Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:14.7153929Z 2025-05-07T20:32:14.7154028Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.7154260Z self=, 2025-05-07T20:32:14.7154334Z T=128, 2025-05-07T20:32:14.7154412Z D=5120, 2025-05-07T20:32:14.7154494Z scale_ub=None, 2025-05-07T20:32:14.7154583Z contiguous=True, 2025-05-07T20:32:14.7154666Z compiled=False, 2025-05-07T20:32:14.7154741Z ) 2025-05-07T20:32:14.7154963Z self = 2025-05-07T20:32:14.7155145Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:14.7155149Z 2025-05-07T20:32:14.7155228Z @given( 2025-05-07T20:32:14.7155346Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.7155444Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.7155559Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.7155675Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.7155790Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.7155864Z ) 2025-05-07T20:32:14.7156120Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.7156212Z def test_silu_mul_quant( 2025-05-07T20:32:14.7156288Z self, 2025-05-07T20:32:14.7156443Z T: int, 2025-05-07T20:32:14.7156526Z D: int, 2025-05-07T20:32:14.7156624Z scale_ub: Optional[float], 2025-05-07T20:32:14.7156711Z contiguous: bool, 2025-05-07T20:32:14.7156802Z compiled: bool, 2025-05-07T20:32:14.7156878Z ) -> None: 2025-05-07T20:32:14.7156974Z torch.manual_seed(2025) 2025-05-07T20:32:14.7157047Z 2025-05-07T20:32:14.7157215Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.7157294Z 2025-05-07T20:32:14.7157385Z x_sign = torch.sign(x) 2025-05-07T20:32:14.7157507Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:14.7157598Z x = x_sign * x_clamp 2025-05-07T20:32:14.7157679Z x0 = x[:, :D] 2025-05-07T20:32:14.7157758Z x1 = x[:, D:] 2025-05-07T20:32:14.7157837Z 2025-05-07T20:32:14.7157917Z if contiguous: 2025-05-07T20:32:14.7158007Z x0 = x0.contiguous() 2025-05-07T20:32:14.7158097Z x1 = x1.contiguous() 2025-05-07T20:32:14.7158176Z 2025-05-07T20:32:14.7158265Z if scale_ub is not None: 2025-05-07T20:32:14.7158370Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:14.7158508Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:14.7158588Z ) 2025-05-07T20:32:14.7158662Z else: 2025-05-07T20:32:14.7158754Z scale_ub_tensor = None 2025-05-07T20:32:14.7158832Z 2025-05-07T20:32:14.7158960Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:14.7159048Z op = silu_mul_quant 2025-05-07T20:32:14.7159137Z if compiled: 2025-05-07T20:32:14.7159235Z op = torch.compile(op) 2025-05-07T20:32:14.7159338Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.7159417Z 2025-05-07T20:32:14.7159506Z > y_fp8, y_scale = fn() 2025-05-07T20:32:14.7159511Z 2025-05-07T20:32:14.7159608Z moe/activation_test.py:117: 2025-05-07T20:32:14.7159742Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.7159842Z moe/activation_test.py:115: in fn 2025-05-07T20:32:14.7159944Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.7160566Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:14.7160662Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:14.7161048Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:14.7161278Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:14.7161643Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:14.7161736Z kernel = self.compile( 2025-05-07T20:32:14.7162143Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:14.7162331Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:14.7162457Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.7162467Z 2025-05-07T20:32:14.7162676Z self = 2025-05-07T20:32:14.7163522Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:14.7164066Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1cef214040>} 2025-05-07T20:32:14.7164882Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:14.7165156Z context = 2025-05-07T20:32:14.7165160Z 2025-05-07T20:32:14.7165335Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:14.7165609Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:14.7165713Z module_map=module_map) 2025-05-07T20:32:14.7165877Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:14.7165976Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:14.7166052Z E ^ 2025-05-07T20:32:14.7166434Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:14.7166438Z 2025-05-07T20:32:14.7166881Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:14.7166892Z 2025-05-07T20:32:14.7166999Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.7167229Z self=, 2025-05-07T20:32:14.7167302Z T=128, 2025-05-07T20:32:14.7167385Z D=7168, 2025-05-07T20:32:14.7167464Z scale_ub=None, 2025-05-07T20:32:14.7167550Z contiguous=True, 2025-05-07T20:32:14.7167636Z compiled=False, 2025-05-07T20:32:14.7167706Z ) 2025-05-07T20:32:14.7167936Z self = 2025-05-07T20:32:14.7168107Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:14.7168112Z 2025-05-07T20:32:14.7168189Z @given( 2025-05-07T20:32:14.7168310Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.7168410Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.7168524Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.7168643Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.7168758Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.7168829Z ) 2025-05-07T20:32:14.7169088Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.7169334Z def test_silu_mul_quant( 2025-05-07T20:32:14.7169416Z self, 2025-05-07T20:32:14.7169491Z T: int, 2025-05-07T20:32:14.7169566Z D: int, 2025-05-07T20:32:14.7169669Z scale_ub: Optional[float], 2025-05-07T20:32:14.7169759Z contiguous: bool, 2025-05-07T20:32:14.7169843Z compiled: bool, 2025-05-07T20:32:14.7169926Z ) -> None: 2025-05-07T20:32:14.7170019Z torch.manual_seed(2025) 2025-05-07T20:32:14.7170091Z 2025-05-07T20:32:14.7170264Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.7170337Z 2025-05-07T20:32:14.7170430Z x_sign = torch.sign(x) 2025-05-07T20:32:14.7170558Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:14.7170648Z x = x_sign * x_clamp 2025-05-07T20:32:14.7170731Z x0 = x[:, :D] 2025-05-07T20:32:14.7170810Z x1 = x[:, D:] 2025-05-07T20:32:14.7170883Z 2025-05-07T20:32:14.7170975Z if contiguous: 2025-05-07T20:32:14.7171065Z x0 = x0.contiguous() 2025-05-07T20:32:14.7171153Z x1 = x1.contiguous() 2025-05-07T20:32:14.7171229Z 2025-05-07T20:32:14.7171318Z if scale_ub is not None: 2025-05-07T20:32:14.7171421Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:14.7171557Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:14.7171631Z ) 2025-05-07T20:32:14.7171704Z else: 2025-05-07T20:32:14.7171810Z scale_ub_tensor = None 2025-05-07T20:32:14.7171911Z 2025-05-07T20:32:14.7172092Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:14.7172220Z op = silu_mul_quant 2025-05-07T20:32:14.7172314Z if compiled: 2025-05-07T20:32:14.7172512Z op = torch.compile(op) 2025-05-07T20:32:14.7172658Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.7172755Z 2025-05-07T20:32:14.7172852Z > y_fp8, y_scale = fn() 2025-05-07T20:32:14.7172862Z 2025-05-07T20:32:14.7172959Z moe/activation_test.py:117: 2025-05-07T20:32:14.7173091Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.7173194Z moe/activation_test.py:115: in fn 2025-05-07T20:32:14.7173293Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.7173838Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:14.7173935Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:14.7174316Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:14.7174551Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:14.7174919Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:14.7175013Z kernel = self.compile( 2025-05-07T20:32:14.7175433Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:14.7175611Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:14.7175745Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.7175750Z 2025-05-07T20:32:14.7175960Z self = 2025-05-07T20:32:14.7176807Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:14.7177354Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1cef214c10>} 2025-05-07T20:32:14.7178260Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:14.7178461Z context = 2025-05-07T20:32:14.7178466Z 2025-05-07T20:32:14.7178633Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:14.7178911Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:14.7179017Z module_map=module_map) 2025-05-07T20:32:14.7179181Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:14.7179283Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:14.7179359Z E ^ 2025-05-07T20:32:14.7179742Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:14.7179747Z 2025-05-07T20:32:14.7180202Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:14.7180206Z 2025-05-07T20:32:14.7180309Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.7180541Z self=, 2025-05-07T20:32:14.7180616Z T=2048, 2025-05-07T20:32:14.7180695Z D=7168, 2025-05-07T20:32:14.7180781Z scale_ub=1200.0, 2025-05-07T20:32:14.7180868Z contiguous=True, 2025-05-07T20:32:14.7180952Z compiled=False, 2025-05-07T20:32:14.7181029Z ) 2025-05-07T20:32:14.7181253Z self = 2025-05-07T20:32:14.7181430Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:14.7181435Z 2025-05-07T20:32:14.7181595Z @given( 2025-05-07T20:32:14.7181714Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.7181812Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.7181931Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.7182045Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.7182159Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.7182232Z ) 2025-05-07T20:32:14.7182487Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.7182582Z def test_silu_mul_quant( 2025-05-07T20:32:14.7182656Z self, 2025-05-07T20:32:14.7182732Z T: int, 2025-05-07T20:32:14.7183147Z D: int, 2025-05-07T20:32:14.7183248Z scale_ub: Optional[float], 2025-05-07T20:32:14.7183341Z contiguous: bool, 2025-05-07T20:32:14.7183425Z compiled: bool, 2025-05-07T20:32:14.7183504Z ) -> None: 2025-05-07T20:32:14.7183599Z torch.manual_seed(2025) 2025-05-07T20:32:14.7183674Z 2025-05-07T20:32:14.7183845Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.7185816Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.70 GiB is allocated by PyTorch, and 53.93 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:14.7185822Z 2025-05-07T20:32:14.7185938Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:14.7185943Z 2025-05-07T20:32:14.7186046Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.7186275Z self=, 2025-05-07T20:32:14.7186353Z T=1, 2025-05-07T20:32:14.7186431Z D=5120, 2025-05-07T20:32:14.7186513Z scale_ub=1200.0, 2025-05-07T20:32:14.7186596Z contiguous=True, 2025-05-07T20:32:14.7186855Z compiled=False, 2025-05-07T20:32:14.7186937Z ) 2025-05-07T20:32:14.7187162Z self = 2025-05-07T20:32:14.7187329Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:14.7187334Z 2025-05-07T20:32:14.7187410Z @given( 2025-05-07T20:32:14.7187530Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.7187632Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.7191052Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.7191193Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.7191311Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.7191386Z ) 2025-05-07T20:32:14.7191655Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.7191754Z def test_silu_mul_quant( 2025-05-07T20:32:14.7191833Z self, 2025-05-07T20:32:14.7191914Z T: int, 2025-05-07T20:32:14.7191990Z D: int, 2025-05-07T20:32:14.7192088Z scale_ub: Optional[float], 2025-05-07T20:32:14.7192178Z contiguous: bool, 2025-05-07T20:32:14.7192266Z compiled: bool, 2025-05-07T20:32:14.7192347Z ) -> None: 2025-05-07T20:32:14.7192444Z torch.manual_seed(2025) 2025-05-07T20:32:14.7192521Z 2025-05-07T20:32:14.7192696Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.7192771Z 2025-05-07T20:32:14.7192863Z x_sign = torch.sign(x) 2025-05-07T20:32:14.7192988Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:14.7193081Z x = x_sign * x_clamp 2025-05-07T20:32:14.7193165Z x0 = x[:, :D] 2025-05-07T20:32:14.7193248Z x1 = x[:, D:] 2025-05-07T20:32:14.7193487Z 2025-05-07T20:32:14.7193570Z if contiguous: 2025-05-07T20:32:14.7193659Z x0 = x0.contiguous() 2025-05-07T20:32:14.7193751Z x1 = x1.contiguous() 2025-05-07T20:32:14.7193830Z 2025-05-07T20:32:14.7193922Z if scale_ub is not None: 2025-05-07T20:32:14.7194032Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:14.7194167Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:14.7194250Z ) 2025-05-07T20:32:14.7194327Z else: 2025-05-07T20:32:14.7194423Z scale_ub_tensor = None 2025-05-07T20:32:14.7194502Z 2025-05-07T20:32:14.7194636Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:14.7194729Z op = silu_mul_quant 2025-05-07T20:32:14.7194820Z if compiled: 2025-05-07T20:32:14.7194919Z op = torch.compile(op) 2025-05-07T20:32:14.7195024Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.7195108Z 2025-05-07T20:32:14.7195198Z > y_fp8, y_scale = fn() 2025-05-07T20:32:14.7195203Z 2025-05-07T20:32:14.7195301Z moe/activation_test.py:117: 2025-05-07T20:32:14.7195439Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.7195538Z moe/activation_test.py:115: in fn 2025-05-07T20:32:14.7195644Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.7196185Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:14.7196285Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:14.7196668Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:14.7196905Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:14.7197263Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:14.7197364Z kernel = self.compile( 2025-05-07T20:32:14.7197774Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:14.7198037Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:14.7198167Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.7198172Z 2025-05-07T20:32:14.7198380Z self = 2025-05-07T20:32:14.7199234Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:14.7199776Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1cef3089d0>} 2025-05-07T20:32:14.7200603Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:14.7200798Z context = 2025-05-07T20:32:14.7200803Z 2025-05-07T20:32:14.7200973Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:14.7201248Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:14.7201355Z module_map=module_map) 2025-05-07T20:32:14.7201522Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:14.7201617Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:14.7201693Z E ^ 2025-05-07T20:32:14.7202079Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:14.7202164Z 2025-05-07T20:32:14.7202609Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:14.7202614Z 2025-05-07T20:32:14.7202725Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.7202955Z self=, 2025-05-07T20:32:14.7203029Z T=2048, 2025-05-07T20:32:14.7203107Z D=5120, 2025-05-07T20:32:14.7203187Z scale_ub=None, 2025-05-07T20:32:14.7203271Z contiguous=True, 2025-05-07T20:32:14.7203358Z compiled=False, 2025-05-07T20:32:14.7203429Z ) 2025-05-07T20:32:14.7203655Z self = 2025-05-07T20:32:14.7203833Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:14.7203838Z 2025-05-07T20:32:14.7203914Z @given( 2025-05-07T20:32:14.7204037Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.7204138Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.7204254Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.7204373Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.7204488Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.7204562Z ) 2025-05-07T20:32:14.7204824Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.7204917Z def test_silu_mul_quant( 2025-05-07T20:32:14.7204990Z self, 2025-05-07T20:32:14.7205070Z T: int, 2025-05-07T20:32:14.7205145Z D: int, 2025-05-07T20:32:14.7205241Z scale_ub: Optional[float], 2025-05-07T20:32:14.7205333Z contiguous: bool, 2025-05-07T20:32:14.7205418Z compiled: bool, 2025-05-07T20:32:14.7205502Z ) -> None: 2025-05-07T20:32:14.7205595Z torch.manual_seed(2025) 2025-05-07T20:32:14.7205669Z 2025-05-07T20:32:14.7205842Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.7205920Z 2025-05-07T20:32:14.7206011Z > x_sign = torch.sign(x) 2025-05-07T20:32:14.7208055Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:14.7208062Z 2025-05-07T20:32:14.7208181Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:32:14.7208186Z 2025-05-07T20:32:14.7208292Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.7208522Z self=, 2025-05-07T20:32:14.7208606Z T=16384, 2025-05-07T20:32:14.7208688Z D=5120, 2025-05-07T20:32:14.7208770Z scale_ub=None, 2025-05-07T20:32:14.7208858Z contiguous=True, 2025-05-07T20:32:14.7208941Z compiled=False, 2025-05-07T20:32:14.7209018Z ) 2025-05-07T20:32:14.7209246Z self = 2025-05-07T20:32:14.7209425Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:14.7209429Z 2025-05-07T20:32:14.7209507Z @given( 2025-05-07T20:32:14.7209631Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.7209726Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.7209857Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.7209990Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.7210117Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.7210199Z ) 2025-05-07T20:32:14.7210455Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.7210661Z def test_silu_mul_quant( 2025-05-07T20:32:14.7210741Z self, 2025-05-07T20:32:14.7210816Z T: int, 2025-05-07T20:32:14.7210898Z D: int, 2025-05-07T20:32:14.7210999Z scale_ub: Optional[float], 2025-05-07T20:32:14.7211086Z contiguous: bool, 2025-05-07T20:32:14.7211167Z compiled: bool, 2025-05-07T20:32:14.7211246Z ) -> None: 2025-05-07T20:32:14.7211338Z torch.manual_seed(2025) 2025-05-07T20:32:14.7211409Z 2025-05-07T20:32:14.7211580Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.7213537Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:14.7213548Z 2025-05-07T20:32:14.7213667Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:14.7213672Z 2025-05-07T20:32:14.7213770Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.7214002Z self=, 2025-05-07T20:32:14.7214079Z T=4096, 2025-05-07T20:32:14.7214152Z D=5120, 2025-05-07T20:32:14.7214235Z scale_ub=None, 2025-05-07T20:32:14.7214317Z contiguous=True, 2025-05-07T20:32:14.7214400Z compiled=False, 2025-05-07T20:32:14.7214485Z ) 2025-05-07T20:32:14.7214707Z self = 2025-05-07T20:32:14.7214880Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:14.7214892Z 2025-05-07T20:32:14.7214966Z @given( 2025-05-07T20:32:14.7215080Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.7215181Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.7215376Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.7215493Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.7215604Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.7215679Z ) 2025-05-07T20:32:14.7215933Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.7216032Z def test_silu_mul_quant( 2025-05-07T20:32:14.7216108Z self, 2025-05-07T20:32:14.7216185Z T: int, 2025-05-07T20:32:14.7216265Z D: int, 2025-05-07T20:32:14.7216359Z scale_ub: Optional[float], 2025-05-07T20:32:14.7216456Z contiguous: bool, 2025-05-07T20:32:14.7216540Z compiled: bool, 2025-05-07T20:32:14.7216618Z ) -> None: 2025-05-07T20:32:14.7216722Z torch.manual_seed(2025) 2025-05-07T20:32:14.7216793Z 2025-05-07T20:32:14.7216960Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.7218913Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:14.7218919Z 2025-05-07T20:32:14.7219033Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:14.7219038Z 2025-05-07T20:32:14.7219141Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.7219471Z self=, 2025-05-07T20:32:14.7219545Z T=2048, 2025-05-07T20:32:14.7219621Z D=5120, 2025-05-07T20:32:14.7219702Z scale_ub=None, 2025-05-07T20:32:14.7219801Z contiguous=False, 2025-05-07T20:32:14.7219886Z compiled=False, 2025-05-07T20:32:14.7219957Z ) 2025-05-07T20:32:14.7220184Z self = 2025-05-07T20:32:14.7220376Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:14.7220382Z 2025-05-07T20:32:14.7220461Z @given( 2025-05-07T20:32:14.7220603Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.7220700Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.7220811Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.7220930Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.7221039Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.7221122Z ) 2025-05-07T20:32:14.7221378Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.7221469Z def test_silu_mul_quant( 2025-05-07T20:32:14.7221548Z self, 2025-05-07T20:32:14.7221627Z T: int, 2025-05-07T20:32:14.7221701Z D: int, 2025-05-07T20:32:14.7221801Z scale_ub: Optional[float], 2025-05-07T20:32:14.7221890Z contiguous: bool, 2025-05-07T20:32:14.7221972Z compiled: bool, 2025-05-07T20:32:14.7222050Z ) -> None: 2025-05-07T20:32:14.7222142Z torch.manual_seed(2025) 2025-05-07T20:32:14.7222218Z 2025-05-07T20:32:14.7222390Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.7224411Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:14.7224423Z 2025-05-07T20:32:14.7224543Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:14.7224548Z 2025-05-07T20:32:14.7224647Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.7224876Z self=, 2025-05-07T20:32:14.7224952Z T=4096, 2025-05-07T20:32:14.7225028Z D=7168, 2025-05-07T20:32:14.7225113Z scale_ub=None, 2025-05-07T20:32:14.7225196Z contiguous=True, 2025-05-07T20:32:14.7225276Z compiled=True, 2025-05-07T20:32:14.7225351Z ) 2025-05-07T20:32:14.7225574Z self = 2025-05-07T20:32:14.7225748Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:14.7225756Z 2025-05-07T20:32:14.7225833Z @given( 2025-05-07T20:32:14.7225949Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.7226054Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.7226169Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.7226282Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.7226395Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.7226471Z ) 2025-05-07T20:32:14.7226727Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.7226822Z def test_silu_mul_quant( 2025-05-07T20:32:14.7226898Z self, 2025-05-07T20:32:14.7226972Z T: int, 2025-05-07T20:32:14.7227050Z D: int, 2025-05-07T20:32:14.7227146Z scale_ub: Optional[float], 2025-05-07T20:32:14.7227235Z contiguous: bool, 2025-05-07T20:32:14.7227400Z compiled: bool, 2025-05-07T20:32:14.7227477Z ) -> None: 2025-05-07T20:32:14.7227573Z torch.manual_seed(2025) 2025-05-07T20:32:14.7227644Z 2025-05-07T20:32:14.7227814Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.7229839Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:14.7229845Z 2025-05-07T20:32:14.7229960Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:14.7229964Z 2025-05-07T20:32:14.7230076Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.7230302Z self=, 2025-05-07T20:32:14.7230379Z T=2048, 2025-05-07T20:32:14.7230459Z D=5120, 2025-05-07T20:32:14.7230541Z scale_ub=1200.0, 2025-05-07T20:32:14.7230629Z contiguous=False, 2025-05-07T20:32:14.7230712Z compiled=False, 2025-05-07T20:32:14.7230782Z ) 2025-05-07T20:32:14.7231008Z self = 2025-05-07T20:32:14.7231186Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:14.7231190Z 2025-05-07T20:32:14.7231264Z @given( 2025-05-07T20:32:14.7231383Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.7231477Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.7231589Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.7231706Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.7231823Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.7231896Z ) 2025-05-07T20:32:14.7232151Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.7232325Z def test_silu_mul_quant( 2025-05-07T20:32:14.7232403Z self, 2025-05-07T20:32:14.7232478Z T: int, 2025-05-07T20:32:14.7232553Z D: int, 2025-05-07T20:32:14.7232653Z scale_ub: Optional[float], 2025-05-07T20:32:14.7232741Z contiguous: bool, 2025-05-07T20:32:14.7232823Z compiled: bool, 2025-05-07T20:32:14.7232903Z ) -> None: 2025-05-07T20:32:14.7232996Z torch.manual_seed(2025) 2025-05-07T20:32:14.7233066Z 2025-05-07T20:32:14.7233241Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.7235186Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:14.7235196Z 2025-05-07T20:32:14.7235313Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:14.7235317Z 2025-05-07T20:32:14.7235417Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.7235645Z self=, 2025-05-07T20:32:14.7235721Z T=4096, 2025-05-07T20:32:14.7235794Z D=7168, 2025-05-07T20:32:14.7235879Z scale_ub=1200.0, 2025-05-07T20:32:14.7235961Z contiguous=True, 2025-05-07T20:32:14.7236046Z compiled=False, 2025-05-07T20:32:14.7236123Z ) 2025-05-07T20:32:14.7236345Z self = 2025-05-07T20:32:14.7236601Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:14.7236609Z 2025-05-07T20:32:14.7236684Z @given( 2025-05-07T20:32:14.7236803Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.7236904Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.7237018Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.7237134Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.7237247Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.7237320Z ) 2025-05-07T20:32:14.7237573Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.7237669Z def test_silu_mul_quant( 2025-05-07T20:32:14.7237742Z self, 2025-05-07T20:32:14.7237819Z T: int, 2025-05-07T20:32:14.7237898Z D: int, 2025-05-07T20:32:14.7237996Z scale_ub: Optional[float], 2025-05-07T20:32:14.7238096Z contiguous: bool, 2025-05-07T20:32:14.7238179Z compiled: bool, 2025-05-07T20:32:14.7238255Z ) -> None: 2025-05-07T20:32:14.7238353Z torch.manual_seed(2025) 2025-05-07T20:32:14.7238434Z 2025-05-07T20:32:14.7238603Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.7240559Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:14.7240565Z 2025-05-07T20:32:14.7240678Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:14.7240688Z 2025-05-07T20:32:14.7240791Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.7241018Z self=, 2025-05-07T20:32:14.7241175Z T=16384, 2025-05-07T20:32:14.7241261Z D=7168, 2025-05-07T20:32:14.7241344Z scale_ub=None, 2025-05-07T20:32:14.7241434Z contiguous=False, 2025-05-07T20:32:14.7241517Z compiled=True, 2025-05-07T20:32:14.7241586Z ) 2025-05-07T20:32:14.7241811Z self = 2025-05-07T20:32:14.7241987Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:14.7241992Z 2025-05-07T20:32:14.7242064Z @given( 2025-05-07T20:32:14.7242182Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.7242276Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.7242386Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.7242507Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.7242617Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.7242695Z ) 2025-05-07T20:32:14.7242954Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.7243046Z def test_silu_mul_quant( 2025-05-07T20:32:14.7243123Z self, 2025-05-07T20:32:14.7243200Z T: int, 2025-05-07T20:32:14.7243278Z D: int, 2025-05-07T20:32:14.7243380Z scale_ub: Optional[float], 2025-05-07T20:32:14.7243466Z contiguous: bool, 2025-05-07T20:32:14.7243549Z compiled: bool, 2025-05-07T20:32:14.7243627Z ) -> None: 2025-05-07T20:32:14.7243719Z torch.manual_seed(2025) 2025-05-07T20:32:14.7243791Z 2025-05-07T20:32:14.7243963Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.7245922Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:14.7246008Z 2025-05-07T20:32:14.7246125Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:14.7246130Z 2025-05-07T20:32:14.7246229Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.7246457Z self=, 2025-05-07T20:32:14.7246533Z T=4096, 2025-05-07T20:32:14.7246609Z D=7168, 2025-05-07T20:32:14.7246691Z scale_ub=None, 2025-05-07T20:32:14.7246775Z contiguous=True, 2025-05-07T20:32:14.7246855Z compiled=False, 2025-05-07T20:32:14.7246935Z ) 2025-05-07T20:32:14.7247160Z self = 2025-05-07T20:32:14.7247331Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:14.7247343Z 2025-05-07T20:32:14.7247418Z @given( 2025-05-07T20:32:14.7247534Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.7247634Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.7247744Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.7247856Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.7247970Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.7248044Z ) 2025-05-07T20:32:14.7248300Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.7248397Z def test_silu_mul_quant( 2025-05-07T20:32:14.7248470Z self, 2025-05-07T20:32:14.7248546Z T: int, 2025-05-07T20:32:14.7248622Z D: int, 2025-05-07T20:32:14.7248723Z scale_ub: Optional[float], 2025-05-07T20:32:14.7248813Z contiguous: bool, 2025-05-07T20:32:14.7248895Z compiled: bool, 2025-05-07T20:32:14.7248973Z ) -> None: 2025-05-07T20:32:14.7249150Z torch.manual_seed(2025) 2025-05-07T20:32:14.7249225Z 2025-05-07T20:32:14.7249392Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.7251346Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:14.7251355Z 2025-05-07T20:32:14.7251468Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:14.7251472Z 2025-05-07T20:32:14.7251579Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.7251812Z self=, 2025-05-07T20:32:14.7251890Z T=16384, 2025-05-07T20:32:14.7251970Z D=7168, 2025-05-07T20:32:14.7252048Z scale_ub=None, 2025-05-07T20:32:14.7252134Z contiguous=True, 2025-05-07T20:32:14.7252219Z compiled=False, 2025-05-07T20:32:14.7252292Z ) 2025-05-07T20:32:14.7252519Z self = 2025-05-07T20:32:14.7252696Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:14.7252700Z 2025-05-07T20:32:14.7252775Z @given( 2025-05-07T20:32:14.7252903Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.7252998Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.7253213Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.7253332Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.7253442Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.7253528Z ) 2025-05-07T20:32:14.7253783Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.7253875Z def test_silu_mul_quant( 2025-05-07T20:32:14.7253955Z self, 2025-05-07T20:32:14.7254032Z T: int, 2025-05-07T20:32:14.7254107Z D: int, 2025-05-07T20:32:14.7254208Z scale_ub: Optional[float], 2025-05-07T20:32:14.7254302Z contiguous: bool, 2025-05-07T20:32:14.7254387Z compiled: bool, 2025-05-07T20:32:14.7254466Z ) -> None: 2025-05-07T20:32:14.7254561Z torch.manual_seed(2025) 2025-05-07T20:32:14.7254635Z 2025-05-07T20:32:14.7254806Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.7256763Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:14.7256773Z 2025-05-07T20:32:14.7256889Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:14.7256894Z 2025-05-07T20:32:14.7256993Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.7257220Z self=, 2025-05-07T20:32:14.7257302Z T=16384, 2025-05-07T20:32:14.7257377Z D=7168, 2025-05-07T20:32:14.7257459Z scale_ub=1200.0, 2025-05-07T20:32:14.7257549Z contiguous=True, 2025-05-07T20:32:14.7257629Z compiled=False, 2025-05-07T20:32:14.7257699Z ) 2025-05-07T20:32:14.7257924Z self = 2025-05-07T20:32:14.7258183Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:14.7258188Z 2025-05-07T20:32:14.7258270Z @given( 2025-05-07T20:32:14.7258384Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.7258479Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.7258591Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.7258704Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.7258814Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.7258890Z ) 2025-05-07T20:32:14.7259144Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.7259235Z def test_silu_mul_quant( 2025-05-07T20:32:14.7259313Z self, 2025-05-07T20:32:14.7259395Z T: int, 2025-05-07T20:32:14.7259475Z D: int, 2025-05-07T20:32:14.7259571Z scale_ub: Optional[float], 2025-05-07T20:32:14.7259657Z contiguous: bool, 2025-05-07T20:32:14.7259749Z compiled: bool, 2025-05-07T20:32:14.7259824Z ) -> None: 2025-05-07T20:32:14.7259918Z torch.manual_seed(2025) 2025-05-07T20:32:14.7259995Z 2025-05-07T20:32:14.7260161Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.7262113Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:14.7262199Z 2025-05-07T20:32:14.7262313Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:14.7262318Z 2025-05-07T20:32:14.7262422Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.7262652Z self=, 2025-05-07T20:32:14.7262729Z T=128, 2025-05-07T20:32:14.7262809Z D=5120, 2025-05-07T20:32:14.7262888Z scale_ub=1200.0, 2025-05-07T20:32:14.7262975Z contiguous=False, 2025-05-07T20:32:14.7263060Z compiled=False, 2025-05-07T20:32:14.7263132Z ) 2025-05-07T20:32:14.7263354Z self = 2025-05-07T20:32:14.7263530Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:14.7263534Z 2025-05-07T20:32:14.7263609Z @given( 2025-05-07T20:32:14.7263726Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.7263831Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.7263944Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.7264064Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.7264179Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.7264253Z ) 2025-05-07T20:32:14.7264511Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.7264601Z def test_silu_mul_quant( 2025-05-07T20:32:14.7264676Z self, 2025-05-07T20:32:14.7264754Z T: int, 2025-05-07T20:32:14.7264830Z D: int, 2025-05-07T20:32:14.7264925Z scale_ub: Optional[float], 2025-05-07T20:32:14.7265016Z contiguous: bool, 2025-05-07T20:32:14.7265099Z compiled: bool, 2025-05-07T20:32:14.7265175Z ) -> None: 2025-05-07T20:32:14.7265273Z torch.manual_seed(2025) 2025-05-07T20:32:14.7265346Z 2025-05-07T20:32:14.7265517Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.7265598Z 2025-05-07T20:32:14.7265689Z x_sign = torch.sign(x) 2025-05-07T20:32:14.7265816Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:14.7265983Z x = x_sign * x_clamp 2025-05-07T20:32:14.7266066Z x0 = x[:, :D] 2025-05-07T20:32:14.7266148Z x1 = x[:, D:] 2025-05-07T20:32:14.7266217Z 2025-05-07T20:32:14.7266298Z if contiguous: 2025-05-07T20:32:14.7266395Z x0 = x0.contiguous() 2025-05-07T20:32:14.7266483Z x1 = x1.contiguous() 2025-05-07T20:32:14.7266552Z 2025-05-07T20:32:14.7266643Z if scale_ub is not None: 2025-05-07T20:32:14.7266748Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:14.7266882Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:14.7266959Z ) 2025-05-07T20:32:14.7267031Z else: 2025-05-07T20:32:14.7267127Z scale_ub_tensor = None 2025-05-07T20:32:14.7267203Z 2025-05-07T20:32:14.7267332Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:14.7267425Z op = silu_mul_quant 2025-05-07T20:32:14.7267510Z if compiled: 2025-05-07T20:32:14.7267617Z op = torch.compile(op) 2025-05-07T20:32:14.7267726Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.7267801Z 2025-05-07T20:32:14.7267890Z > y_fp8, y_scale = fn() 2025-05-07T20:32:14.7267895Z 2025-05-07T20:32:14.7267994Z moe/activation_test.py:117: 2025-05-07T20:32:14.7268125Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.7268231Z moe/activation_test.py:115: in fn 2025-05-07T20:32:14.7268331Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.7268869Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:14.7268973Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:14.7269438Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:14.7269671Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:14.7270132Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:14.7270224Z kernel = self.compile( 2025-05-07T20:32:14.7270637Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:14.7270816Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:14.7270943Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.7270948Z 2025-05-07T20:32:14.7271161Z self = 2025-05-07T20:32:14.7272006Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:14.7272568Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1ceefeb670>} 2025-05-07T20:32:14.7273380Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:14.7273575Z context = 2025-05-07T20:32:14.7273582Z 2025-05-07T20:32:14.7273749Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:14.7274025Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:14.7274134Z module_map=module_map) 2025-05-07T20:32:14.7274299Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:14.7274399Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:14.7274483Z E ^ 2025-05-07T20:32:14.7274945Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:14.7274950Z 2025-05-07T20:32:14.7275398Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:14.7275403Z 2025-05-07T20:32:14.7275503Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.7275732Z self=, 2025-05-07T20:32:14.7275814Z T=2048, 2025-05-07T20:32:14.7275887Z D=7168, 2025-05-07T20:32:14.7275967Z scale_ub=None, 2025-05-07T20:32:14.7276055Z contiguous=False, 2025-05-07T20:32:14.7276138Z compiled=False, 2025-05-07T20:32:14.7276210Z ) 2025-05-07T20:32:14.7276443Z self = 2025-05-07T20:32:14.7276620Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:14.7276624Z 2025-05-07T20:32:14.7276711Z @given( 2025-05-07T20:32:14.7276831Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.7276926Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.7277043Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.7277156Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.7277266Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.7277348Z ) 2025-05-07T20:32:14.7277604Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.7277698Z def test_silu_mul_quant( 2025-05-07T20:32:14.7277772Z self, 2025-05-07T20:32:14.7277849Z T: int, 2025-05-07T20:32:14.7277932Z D: int, 2025-05-07T20:32:14.7278028Z scale_ub: Optional[float], 2025-05-07T20:32:14.7278198Z contiguous: bool, 2025-05-07T20:32:14.7278288Z compiled: bool, 2025-05-07T20:32:14.7278366Z ) -> None: 2025-05-07T20:32:14.7278462Z torch.manual_seed(2025) 2025-05-07T20:32:14.7278542Z 2025-05-07T20:32:14.7278713Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.7280676Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 5.24 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:14.7280682Z 2025-05-07T20:32:14.7280802Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:14.7280806Z 2025-05-07T20:32:14.7280908Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.7281145Z self=, 2025-05-07T20:32:14.7281223Z T=128, 2025-05-07T20:32:14.7281303Z D=7168, 2025-05-07T20:32:14.7281388Z scale_ub=1200.0, 2025-05-07T20:32:14.7281470Z contiguous=True, 2025-05-07T20:32:14.7281557Z compiled=True, 2025-05-07T20:32:14.7281627Z ) 2025-05-07T20:32:14.7281850Z self = 2025-05-07T20:32:14.7282023Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:14.7282028Z 2025-05-07T20:32:14.7282101Z @given( 2025-05-07T20:32:14.7282219Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.7282314Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.7282425Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.7282548Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.7282659Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.7283022Z ) 2025-05-07T20:32:14.7283588Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.7283705Z def test_silu_mul_quant( 2025-05-07T20:32:14.7283782Z self, 2025-05-07T20:32:14.7283861Z T: int, 2025-05-07T20:32:14.7283938Z D: int, 2025-05-07T20:32:14.7284034Z scale_ub: Optional[float], 2025-05-07T20:32:14.7284127Z contiguous: bool, 2025-05-07T20:32:14.7284211Z compiled: bool, 2025-05-07T20:32:14.7284290Z ) -> None: 2025-05-07T20:32:14.7284388Z torch.manual_seed(2025) 2025-05-07T20:32:14.7284461Z 2025-05-07T20:32:14.7284635Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.7284707Z 2025-05-07T20:32:14.7284795Z x_sign = torch.sign(x) 2025-05-07T20:32:14.7284926Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:14.7285013Z x = x_sign * x_clamp 2025-05-07T20:32:14.7285094Z x0 = x[:, :D] 2025-05-07T20:32:14.7285179Z x1 = x[:, D:] 2025-05-07T20:32:14.7285253Z 2025-05-07T20:32:14.7285335Z if contiguous: 2025-05-07T20:32:14.7285430Z x0 = x0.contiguous() 2025-05-07T20:32:14.7285515Z x1 = x1.contiguous() 2025-05-07T20:32:14.7285589Z 2025-05-07T20:32:14.7285683Z if scale_ub is not None: 2025-05-07T20:32:14.7285786Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:14.7285926Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:14.7285998Z ) 2025-05-07T20:32:14.7286070Z else: 2025-05-07T20:32:14.7286165Z scale_ub_tensor = None 2025-05-07T20:32:14.7286238Z 2025-05-07T20:32:14.7286368Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:14.7286461Z op = silu_mul_quant 2025-05-07T20:32:14.7286666Z if compiled: 2025-05-07T20:32:14.7286765Z op = torch.compile(op) 2025-05-07T20:32:14.7286877Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.7286956Z 2025-05-07T20:32:14.7287048Z > y_fp8, y_scale = fn() 2025-05-07T20:32:14.7287053Z 2025-05-07T20:32:14.7287152Z moe/activation_test.py:117: 2025-05-07T20:32:14.7287284Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.7287387Z moe/activation_test.py:115: in fn 2025-05-07T20:32:14.7287484Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.7287877Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:14.7287973Z return fn(*args, **kwargs) 2025-05-07T20:32:14.7288507Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:14.7288608Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:14.7288995Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:14.7289230Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:14.7289594Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:14.7289689Z kernel = self.compile( 2025-05-07T20:32:14.7290097Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:14.7290277Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:14.7290404Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.7290409Z 2025-05-07T20:32:14.7290622Z self = 2025-05-07T20:32:14.7291468Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:14.7292100Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1ceefda5e0>} 2025-05-07T20:32:14.7292915Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:14.7293110Z context = 2025-05-07T20:32:14.7293115Z 2025-05-07T20:32:14.7293284Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:14.7293558Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:14.7293669Z module_map=module_map) 2025-05-07T20:32:14.7293833Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:14.7293929Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:14.7294016Z E ^ 2025-05-07T20:32:14.7294397Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:14.7294402Z 2025-05-07T20:32:14.7294845Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:14.7294850Z 2025-05-07T20:32:14.7294959Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.7295188Z self=, 2025-05-07T20:32:14.7295267Z T=128, 2025-05-07T20:32:14.7295347Z D=7168, 2025-05-07T20:32:14.7295430Z scale_ub=1200.0, 2025-05-07T20:32:14.7295518Z contiguous=True, 2025-05-07T20:32:14.7295602Z compiled=False, 2025-05-07T20:32:14.7295751Z ) 2025-05-07T20:32:14.7295977Z self = 2025-05-07T20:32:14.7296156Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:14.7296160Z 2025-05-07T20:32:14.7296232Z @given( 2025-05-07T20:32:14.7296352Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.7296446Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.7296560Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.7296677Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.7296790Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.7296868Z ) 2025-05-07T20:32:14.7297124Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.7297214Z def test_silu_mul_quant( 2025-05-07T20:32:14.7297292Z self, 2025-05-07T20:32:14.7297369Z T: int, 2025-05-07T20:32:14.7297450Z D: int, 2025-05-07T20:32:14.7297550Z scale_ub: Optional[float], 2025-05-07T20:32:14.7297636Z contiguous: bool, 2025-05-07T20:32:14.7297719Z compiled: bool, 2025-05-07T20:32:14.7297799Z ) -> None: 2025-05-07T20:32:14.7297894Z torch.manual_seed(2025) 2025-05-07T20:32:14.7297966Z 2025-05-07T20:32:14.7298139Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.7298212Z 2025-05-07T20:32:14.7298306Z x_sign = torch.sign(x) 2025-05-07T20:32:14.7298428Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:14.7300381Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 4.62 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:14.7300394Z 2025-05-07T20:32:14.7300587Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:14.7300593Z 2025-05-07T20:32:14.7300693Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.7300926Z self=, 2025-05-07T20:32:14.7301001Z T=128, 2025-05-07T20:32:14.7301075Z D=5120, 2025-05-07T20:32:14.7301160Z scale_ub=1200.0, 2025-05-07T20:32:14.7301243Z contiguous=True, 2025-05-07T20:32:14.7301329Z compiled=True, 2025-05-07T20:32:14.7301405Z ) 2025-05-07T20:32:14.7301630Z self = 2025-05-07T20:32:14.7301803Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:14.7301808Z 2025-05-07T20:32:14.7301885Z @given( 2025-05-07T20:32:14.7302000Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.7302096Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.7302215Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.7302331Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.7302445Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.7302516Z ) 2025-05-07T20:32:14.7302773Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.7302868Z def test_silu_mul_quant( 2025-05-07T20:32:14.7302944Z self, 2025-05-07T20:32:14.7303022Z T: int, 2025-05-07T20:32:14.7303098Z D: int, 2025-05-07T20:32:14.7303195Z scale_ub: Optional[float], 2025-05-07T20:32:14.7303285Z contiguous: bool, 2025-05-07T20:32:14.7303368Z compiled: bool, 2025-05-07T20:32:14.7303445Z ) -> None: 2025-05-07T20:32:14.7303544Z torch.manual_seed(2025) 2025-05-07T20:32:14.7303769Z 2025-05-07T20:32:14.7303937Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.7304014Z 2025-05-07T20:32:14.7304108Z > x_sign = torch.sign(x) 2025-05-07T20:32:14.7306056Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:14.7306062Z 2025-05-07T20:32:14.7306176Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:32:14.7306181Z 2025-05-07T20:32:14.7306286Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.7306519Z self=, 2025-05-07T20:32:14.7306594Z T=128, 2025-05-07T20:32:14.7306675Z D=7168, 2025-05-07T20:32:14.7306760Z scale_ub=None, 2025-05-07T20:32:14.7306843Z contiguous=True, 2025-05-07T20:32:14.7306930Z compiled=True, 2025-05-07T20:32:14.7307000Z ) 2025-05-07T20:32:14.7307223Z self = 2025-05-07T20:32:14.7307396Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:14.7307400Z 2025-05-07T20:32:14.7307477Z @given( 2025-05-07T20:32:14.7307595Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.7307689Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.7307806Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.7307924Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.7308036Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.7308116Z ) 2025-05-07T20:32:14.7308379Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.7308470Z def test_silu_mul_quant( 2025-05-07T20:32:14.7308624Z self, 2025-05-07T20:32:14.7308705Z T: int, 2025-05-07T20:32:14.7308778Z D: int, 2025-05-07T20:32:14.7308875Z scale_ub: Optional[float], 2025-05-07T20:32:14.7308965Z contiguous: bool, 2025-05-07T20:32:14.7309047Z compiled: bool, 2025-05-07T20:32:14.7309129Z ) -> None: 2025-05-07T20:32:14.7309223Z torch.manual_seed(2025) 2025-05-07T20:32:14.7309293Z 2025-05-07T20:32:14.7309464Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.7311500Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:14.7311510Z 2025-05-07T20:32:14.7311632Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:14.7311800Z =============================== warnings summary =============================== 2025-05-07T20:32:14.7312171Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:32:14.7312491Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:32:14.7312801Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:32:14.7313862Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:108: DeprecationWarning: warmup, rep, and use_cuda_graph parameters are deprecated. See https://github.com/triton-lang/triton/pull/4496 for details. 2025-05-07T20:32:14.7314101Z warnings.warn(("warmup, rep, and use_cuda_graph parameters are deprecated. See " 2025-05-07T20:32:14.7314106Z 2025-05-07T20:32:14.7314286Z experimental/gen_ai/test/moe/activation_test.py: 10 warnings 2025-05-07T20:32:14.7315671Z /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py:72: FutureWarning: `torch.testing.assert_allclose()` is deprecated since 1.12 and will be removed in a future release. Please use `torch.testing.assert_close()` instead. You can find detailed upgrade instructions in https://github.com/pytorch/pytorch/issues/61844. 2025-05-07T20:32:14.7315859Z torch.testing.assert_allclose(y, y_ref, rtol=1.6e-2, atol=1e-3) 2025-05-07T20:32:14.7315864Z 2025-05-07T20:32:14.7316086Z -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html 2025-05-07T20:32:14.7316253Z ================== 1 failed, 1 passed, 13 warnings in 33.02s =================== 2025-05-07T20:32:16.4578685Z ERROR conda.cli.main_run:execute(125): `conda run python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./moe/activation_test.py` failed. (See above for error) 2025-05-07T20:32:16.5204297Z 2025-05-07T20:32:16.5204745Z [TEST] Some tests FAILED. Re-attempting only FAILED tests: ./moe/activation_test.py 2025-05-07T20:32:16.5205128Z 2025-05-07T20:32:16.5205133Z 2025-05-07T20:32:16.5227474Z [EXEC] [ATTEMPT 0/2] + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py 2025-05-07T20:32:18.6725584Z ============================= test session starts ============================== 2025-05-07T20:32:18.6726277Z platform linux -- Python 3.9.18, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python 2025-05-07T20:32:18.6726874Z cachedir: .pytest_cache 2025-05-07T20:32:18.6727830Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,) 2025-05-07T20:32:18.6728622Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu 2025-05-07T20:32:18.6729053Z plugins: hypothesis-6.131.14 2025-05-07T20:32:20.3034768Z TMA benchmarks will be running with experimental grid constant TMA descriptor. 2025-05-07T20:32:20.5175526Z collecting ... collected 2 items / 1 deselected / 1 selected 2025-05-07T20:32:20.5175984Z run-last-failure: rerun previous 1 failure 2025-05-07T20:32:20.5176218Z 2025-05-07T20:32:22.7421014Z W0507 20:32:22.740729 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:22.7422208Z W0507 20:32:22.740729 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] Traceback (most recent call last): 2025-05-07T20:32:22.7423744Z W0507 20:32:22.740729 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:22.7425362Z W0507 20:32:22.740729 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:22.7426901Z W0507 20:32:22.740729 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:22.7428445Z W0507 20:32:22.740729 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:22.7430460Z W0507 20:32:22.740729 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:22.7431984Z W0507 20:32:22.740729 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:22.7433555Z W0507 20:32:22.740729 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:22.7434935Z W0507 20:32:22.740729 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] generator.visit(fn.parse()) 2025-05-07T20:32:22.7436277Z W0507 20:32:22.740729 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:22.7437632Z W0507 20:32:22.740729 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ret = super().visit(node) 2025-05-07T20:32:22.7438771Z W0507 20:32:22.740729 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit 2025-05-07T20:32:22.7439885Z W0507 20:32:22.740729 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] return visitor(node) 2025-05-07T20:32:22.7441229Z W0507 20:32:22.740729 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:22.7442665Z W0507 20:32:22.740729 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:22.7444049Z W0507 20:32:22.740729 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit 2025-05-07T20:32:22.7445193Z W0507 20:32:22.740729 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] self.visit(item) 2025-05-07T20:32:22.7446485Z W0507 20:32:22.740729 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:22.7447987Z W0507 20:32:22.740729 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:22.7449151Z W0507 20:32:22.740729 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:22.7450145Z W0507 20:32:22.740729 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:22.7450946Z W0507 20:32:22.740729 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ^ 2025-05-07T20:32:22.7452055Z W0507 20:32:22.740729 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:22.7594288Z W0507 20:32:22.758873 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:22.7595440Z W0507 20:32:22.758873 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] Traceback (most recent call last): 2025-05-07T20:32:22.7597157Z W0507 20:32:22.758873 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:22.7598731Z W0507 20:32:22.758873 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:22.7600248Z W0507 20:32:22.758873 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:22.7601773Z W0507 20:32:22.758873 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:22.7603220Z W0507 20:32:22.758873 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:22.7604736Z W0507 20:32:22.758873 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:22.7606295Z W0507 20:32:22.758873 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:22.7607664Z W0507 20:32:22.758873 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] generator.visit(fn.parse()) 2025-05-07T20:32:22.7608995Z W0507 20:32:22.758873 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:22.7610460Z W0507 20:32:22.758873 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ret = super().visit(node) 2025-05-07T20:32:22.7611607Z W0507 20:32:22.758873 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit 2025-05-07T20:32:22.7612771Z W0507 20:32:22.758873 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] return visitor(node) 2025-05-07T20:32:22.7614104Z W0507 20:32:22.758873 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:22.7615513Z W0507 20:32:22.758873 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:22.7616742Z W0507 20:32:22.758873 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit 2025-05-07T20:32:22.7617879Z W0507 20:32:22.758873 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] self.visit(item) 2025-05-07T20:32:22.7619171Z W0507 20:32:22.758873 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:22.7620657Z W0507 20:32:22.758873 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:22.7621818Z W0507 20:32:22.758873 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:22.7622949Z W0507 20:32:22.758873 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:22.7623751Z W0507 20:32:22.758873 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ^ 2025-05-07T20:32:22.7624866Z W0507 20:32:22.758873 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:23.4140370Z moe/activation_test.py::ActivationTests::test_silu_mul_quant Trying example: test_silu_mul_quant( 2025-05-07T20:32:23.4141117Z self=, 2025-05-07T20:32:23.4141549Z T=1, 2025-05-07T20:32:23.4141741Z D=5120, 2025-05-07T20:32:23.4141934Z scale_ub=None, 2025-05-07T20:32:23.4142174Z contiguous=True, 2025-05-07T20:32:23.4142396Z compiled=True, 2025-05-07T20:32:23.4142606Z ) 2025-05-07T20:32:23.4142932Z self = 2025-05-07T20:32:23.4143463Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:23.4143742Z 2025-05-07T20:32:23.4143826Z @given( 2025-05-07T20:32:23.4144065Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:23.4144381Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:23.4144697Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:23.4145036Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:23.4145367Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:23.4145664Z ) 2025-05-07T20:32:23.4146030Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:23.4146495Z def test_silu_mul_quant( 2025-05-07T20:32:23.4146746Z self, 2025-05-07T20:32:23.4146948Z T: int, 2025-05-07T20:32:23.4147141Z D: int, 2025-05-07T20:32:23.4147363Z scale_ub: Optional[float], 2025-05-07T20:32:23.4147643Z contiguous: bool, 2025-05-07T20:32:23.4148230Z compiled: bool, 2025-05-07T20:32:23.4148465Z ) -> None: 2025-05-07T20:32:23.4148681Z torch.manual_seed(2025) 2025-05-07T20:32:23.4148923Z 2025-05-07T20:32:23.4149193Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:23.4149544Z 2025-05-07T20:32:23.4150053Z x_sign = torch.sign(x) 2025-05-07T20:32:23.4150344Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:23.4150666Z x = x_sign * x_clamp 2025-05-07T20:32:23.4150911Z x0 = x[:, :D] 2025-05-07T20:32:23.4151120Z x1 = x[:, D:] 2025-05-07T20:32:23.4151326Z 2025-05-07T20:32:23.4151509Z if contiguous: 2025-05-07T20:32:23.4151734Z x0 = x0.contiguous() 2025-05-07T20:32:23.4151997Z x1 = x1.contiguous() 2025-05-07T20:32:23.4152251Z 2025-05-07T20:32:23.4152435Z if scale_ub is not None: 2025-05-07T20:32:23.4152714Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:23.4153064Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:23.4153383Z ) 2025-05-07T20:32:23.4153574Z else: 2025-05-07T20:32:23.4153785Z scale_ub_tensor = None 2025-05-07T20:32:23.4154042Z 2025-05-07T20:32:23.4154270Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:23.4154596Z op = silu_mul_quant 2025-05-07T20:32:23.4154853Z if compiled: 2025-05-07T20:32:23.4155099Z op = torch.compile(op) 2025-05-07T20:32:23.4155410Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:23.4155699Z 2025-05-07T20:32:23.4155887Z y_fp8, y_scale = fn() 2025-05-07T20:32:23.4156182Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:23.4156486Z 2025-05-07T20:32:23.4156883Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:23.4157234Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:23.4157534Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:23.4157865Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:23.4158235Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:23.4158556Z 2025-05-07T20:32:23.4158756Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:23.4158959Z 2025-05-07T20:32:23.4159058Z moe/activation_test.py:126: 2025-05-07T20:32:23.4159362Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:23.4159714Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:23.4160044Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:23.4160898Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:23.4161721Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:23.4162303Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:23.4163035Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:23.4163774Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:23.4164546Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:23.4165356Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:23.4166156Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:23.4166938Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:23.4167624Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:23.4168261Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:23.4168912Z fn() 2025-05-07T20:32:23.4169449Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:23.4170071Z self.fn.run( 2025-05-07T20:32:23.4170551Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:23.4171115Z kernel = self.compile( 2025-05-07T20:32:23.4171687Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:23.4172374Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:23.4172788Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:23.4173043Z 2025-05-07T20:32:23.4173256Z self = 2025-05-07T20:32:23.4174431Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:23.4176016Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3dd80e7040>} 2025-05-07T20:32:23.4177489Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:23.4178591Z context = 2025-05-07T20:32:23.4178902Z 2025-05-07T20:32:23.4179075Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:23.4179706Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:23.4180201Z module_map=module_map) 2025-05-07T20:32:23.4180572Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:23.4180939Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:23.4181210Z E ^ 2025-05-07T20:32:23.4181693Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:23.4182186Z 2025-05-07T20:32:23.4182634Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:23.4183468Z 2025-05-07T20:32:23.4183574Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:23.4184005Z self=, 2025-05-07T20:32:23.4184425Z T=2048, 2025-05-07T20:32:23.4184621Z D=5120, 2025-05-07T20:32:23.4184820Z scale_ub=1200.0, 2025-05-07T20:32:23.4185038Z contiguous=True, 2025-05-07T20:32:23.4185274Z compiled=False, 2025-05-07T20:32:23.4185478Z ) 2025-05-07T20:32:24.4478480Z W0507 20:32:24.443299 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:24.4480832Z W0507 20:32:24.443299 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] Traceback (most recent call last): 2025-05-07T20:32:24.4483163Z W0507 20:32:24.443299 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:24.4484759Z W0507 20:32:24.443299 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:24.4486662Z W0507 20:32:24.443299 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:24.4488194Z W0507 20:32:24.443299 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:24.4489636Z W0507 20:32:24.443299 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:24.4491153Z W0507 20:32:24.443299 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:24.4492716Z W0507 20:32:24.443299 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:24.4494088Z W0507 20:32:24.443299 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] generator.visit(fn.parse()) 2025-05-07T20:32:24.4495429Z W0507 20:32:24.443299 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:24.4496761Z W0507 20:32:24.443299 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ret = super().visit(node) 2025-05-07T20:32:24.4497900Z W0507 20:32:24.443299 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit 2025-05-07T20:32:24.4499167Z W0507 20:32:24.443299 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] return visitor(node) 2025-05-07T20:32:24.4500514Z W0507 20:32:24.443299 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:24.4501927Z W0507 20:32:24.443299 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:24.4503147Z W0507 20:32:24.443299 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit 2025-05-07T20:32:24.4504288Z W0507 20:32:24.443299 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] self.visit(item) 2025-05-07T20:32:24.4505587Z W0507 20:32:24.443299 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:24.4507093Z W0507 20:32:24.443299 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:24.4508253Z W0507 20:32:24.443299 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:24.4509239Z W0507 20:32:24.443299 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:24.4510143Z W0507 20:32:24.443299 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ^ 2025-05-07T20:32:24.4511250Z W0507 20:32:24.443299 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:24.6828149Z W0507 20:32:24.678800 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:24.6829311Z W0507 20:32:24.678800 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] Traceback (most recent call last): 2025-05-07T20:32:24.6830917Z W0507 20:32:24.678800 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:24.6832483Z W0507 20:32:24.678800 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:24.6834005Z W0507 20:32:24.678800 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:24.6835531Z W0507 20:32:24.678800 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:24.6836966Z W0507 20:32:24.678800 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:24.6838475Z W0507 20:32:24.678800 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:24.6840036Z W0507 20:32:24.678800 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:24.6841532Z W0507 20:32:24.678800 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] generator.visit(fn.parse()) 2025-05-07T20:32:24.6842864Z W0507 20:32:24.678800 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:24.6844193Z W0507 20:32:24.678800 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ret = super().visit(node) 2025-05-07T20:32:24.6845331Z W0507 20:32:24.678800 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit 2025-05-07T20:32:24.6846453Z W0507 20:32:24.678800 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] return visitor(node) 2025-05-07T20:32:24.6847790Z W0507 20:32:24.678800 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:24.6849202Z W0507 20:32:24.678800 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:24.6850419Z W0507 20:32:24.678800 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit 2025-05-07T20:32:24.6851560Z W0507 20:32:24.678800 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] self.visit(item) 2025-05-07T20:32:24.6852911Z W0507 20:32:24.678800 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:24.6854484Z W0507 20:32:24.678800 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:24.6855650Z W0507 20:32:24.678800 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:24.6856641Z W0507 20:32:24.678800 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:24.6857443Z W0507 20:32:24.678800 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ^ 2025-05-07T20:32:24.6858550Z W0507 20:32:24.678800 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:25.5545340Z self = 2025-05-07T20:32:25.5546452Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:25.5547066Z 2025-05-07T20:32:25.5547230Z @given( 2025-05-07T20:32:25.5547716Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.5548362Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.5548998Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.5549919Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.5550597Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.5551192Z ) 2025-05-07T20:32:25.5551912Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.5552582Z def test_silu_mul_quant( 2025-05-07T20:32:25.5552851Z self, 2025-05-07T20:32:25.5553208Z T: int, 2025-05-07T20:32:25.5553405Z D: int, 2025-05-07T20:32:25.5553621Z scale_ub: Optional[float], 2025-05-07T20:32:25.5553895Z contiguous: bool, 2025-05-07T20:32:25.5554134Z compiled: bool, 2025-05-07T20:32:25.5554367Z ) -> None: 2025-05-07T20:32:25.5554579Z torch.manual_seed(2025) 2025-05-07T20:32:25.5554820Z 2025-05-07T20:32:25.5555094Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.5555452Z 2025-05-07T20:32:25.5555644Z x_sign = torch.sign(x) 2025-05-07T20:32:25.5555939Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:25.5556256Z x = x_sign * x_clamp 2025-05-07T20:32:25.5556492Z x0 = x[:, :D] 2025-05-07T20:32:25.5556709Z x1 = x[:, D:] 2025-05-07T20:32:25.5556919Z 2025-05-07T20:32:25.5557097Z if contiguous: 2025-05-07T20:32:25.5557328Z x0 = x0.contiguous() 2025-05-07T20:32:25.5557591Z x1 = x1.contiguous() 2025-05-07T20:32:25.5557836Z 2025-05-07T20:32:25.5558028Z if scale_ub is not None: 2025-05-07T20:32:25.5558304Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:25.5558643Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:25.5558964Z ) 2025-05-07T20:32:25.5559156Z else: 2025-05-07T20:32:25.5559359Z scale_ub_tensor = None 2025-05-07T20:32:25.5559615Z 2025-05-07T20:32:25.5559847Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:25.5560162Z op = silu_mul_quant 2025-05-07T20:32:25.5560419Z if compiled: 2025-05-07T20:32:25.5560665Z op = torch.compile(op) 2025-05-07T20:32:25.5560972Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.5561249Z 2025-05-07T20:32:25.5561441Z > y_fp8, y_scale = fn() 2025-05-07T20:32:25.5561610Z 2025-05-07T20:32:25.5561714Z moe/activation_test.py:117: 2025-05-07T20:32:25.5562013Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.5562364Z moe/activation_test.py:115: in fn 2025-05-07T20:32:25.5562655Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.5563518Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:25.5564271Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:25.5564841Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:25.5565574Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:25.5566277Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:25.5566845Z kernel = self.compile( 2025-05-07T20:32:25.5567417Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:25.5568116Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:25.5568533Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.5568780Z 2025-05-07T20:32:25.5568998Z self = 2025-05-07T20:32:25.5570171Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:25.5571678Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3dd85aa4c0>} 2025-05-07T20:32:25.5573143Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:25.5574357Z context = 2025-05-07T20:32:25.5574670Z 2025-05-07T20:32:25.5574841Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:25.5575389Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:25.5575872Z module_map=module_map) 2025-05-07T20:32:25.5576246Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:25.5576606Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:25.5576861Z E ^ 2025-05-07T20:32:25.5577352Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:25.5577846Z 2025-05-07T20:32:25.5578292Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:25.5578846Z 2025-05-07T20:32:25.5578960Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.5579381Z self=, 2025-05-07T20:32:25.5579805Z T=2048, 2025-05-07T20:32:25.5579994Z D=5120, 2025-05-07T20:32:25.5580185Z scale_ub=1200.0, 2025-05-07T20:32:25.5580407Z contiguous=True, 2025-05-07T20:32:25.5580628Z compiled=True, 2025-05-07T20:32:25.5580832Z ) 2025-05-07T20:32:25.5581157Z self = 2025-05-07T20:32:25.5581675Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:25.5581961Z 2025-05-07T20:32:25.5582047Z @given( 2025-05-07T20:32:25.5582274Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.5582640Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.5583142Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.5583760Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.5584232Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.5584617Z ) 2025-05-07T20:32:25.5585076Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.5585701Z def test_silu_mul_quant( 2025-05-07T20:32:25.5585953Z self, 2025-05-07T20:32:25.5586149Z T: int, 2025-05-07T20:32:25.5586341Z D: int, 2025-05-07T20:32:25.5586559Z scale_ub: Optional[float], 2025-05-07T20:32:25.5586837Z contiguous: bool, 2025-05-07T20:32:25.5587073Z compiled: bool, 2025-05-07T20:32:25.5587297Z ) -> None: 2025-05-07T20:32:25.5587514Z torch.manual_seed(2025) 2025-05-07T20:32:25.5587754Z 2025-05-07T20:32:25.5588031Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.5588385Z 2025-05-07T20:32:25.5588571Z x_sign = torch.sign(x) 2025-05-07T20:32:25.5588867Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:25.5589188Z x = x_sign * x_clamp 2025-05-07T20:32:25.5589428Z x0 = x[:, :D] 2025-05-07T20:32:25.5589746Z x1 = x[:, D:] 2025-05-07T20:32:25.5589954Z 2025-05-07T20:32:25.5590134Z if contiguous: 2025-05-07T20:32:25.5590369Z x0 = x0.contiguous() 2025-05-07T20:32:25.5590631Z x1 = x1.contiguous() 2025-05-07T20:32:25.5590869Z 2025-05-07T20:32:25.5591057Z if scale_ub is not None: 2025-05-07T20:32:25.5591334Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:25.5591674Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:25.5591985Z ) 2025-05-07T20:32:25.5592178Z else: 2025-05-07T20:32:25.5592386Z scale_ub_tensor = None 2025-05-07T20:32:25.5592637Z 2025-05-07T20:32:25.5592870Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:25.5593195Z op = silu_mul_quant 2025-05-07T20:32:25.5593439Z if compiled: 2025-05-07T20:32:25.5593692Z op = torch.compile(op) 2025-05-07T20:32:25.5594126Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.5594404Z 2025-05-07T20:32:25.5594597Z y_fp8, y_scale = fn() 2025-05-07T20:32:25.5594891Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:25.5595186Z 2025-05-07T20:32:25.5601662Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:25.5602073Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:25.5602394Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:25.5602730Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:25.5603114Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:25.5603435Z 2025-05-07T20:32:25.5603650Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:25.5603861Z 2025-05-07T20:32:25.5603979Z moe/activation_test.py:126: 2025-05-07T20:32:25.5604291Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.5604659Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:25.5605005Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:25.5605868Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:25.5606684Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:25.5607272Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:25.5608010Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:25.5608747Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:25.5609530Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:25.5610346Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:25.5611162Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:25.5612047Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:25.5612797Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:25.5613445Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:25.5614005Z fn() 2025-05-07T20:32:25.5614543Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:25.5615177Z self.fn.run( 2025-05-07T20:32:25.5615678Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:25.5616246Z kernel = self.compile( 2025-05-07T20:32:25.5616822Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:25.5617528Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:25.5617951Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.5618195Z 2025-05-07T20:32:25.5618409Z self = 2025-05-07T20:32:25.5619587Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:25.5621099Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3dd6cde0d0>} 2025-05-07T20:32:25.5622621Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:25.5623813Z context = 2025-05-07T20:32:25.5624121Z 2025-05-07T20:32:25.5624293Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:25.5624853Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:25.5625350Z module_map=module_map) 2025-05-07T20:32:25.5625726Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:25.5626092Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:25.5626364Z E ^ 2025-05-07T20:32:25.5626851Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:25.5627345Z 2025-05-07T20:32:25.5627793Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:25.5628362Z 2025-05-07T20:32:25.5628467Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.5628898Z self=, 2025-05-07T20:32:25.5629317Z T=16384, 2025-05-07T20:32:25.5629513Z D=7168, 2025-05-07T20:32:25.5629857Z scale_ub=1200.0, 2025-05-07T20:32:25.5630081Z contiguous=False, 2025-05-07T20:32:25.5630311Z compiled=False, 2025-05-07T20:32:25.5630515Z ) 2025-05-07T20:32:26.1869019Z W0507 20:32:26.182805 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:26.1870289Z W0507 20:32:26.182805 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] Traceback (most recent call last): 2025-05-07T20:32:26.1871773Z W0507 20:32:26.182805 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:26.1873796Z W0507 20:32:26.182805 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:26.1875326Z W0507 20:32:26.182805 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:26.1876860Z W0507 20:32:26.182805 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:26.1878298Z W0507 20:32:26.182805 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:26.1879827Z W0507 20:32:26.182805 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:26.1881386Z W0507 20:32:26.182805 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:26.1883001Z W0507 20:32:26.182805 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] generator.visit(fn.parse()) 2025-05-07T20:32:26.1884351Z W0507 20:32:26.182805 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:26.1885863Z W0507 20:32:26.182805 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ret = super().visit(node) 2025-05-07T20:32:26.1887008Z W0507 20:32:26.182805 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit 2025-05-07T20:32:26.1888131Z W0507 20:32:26.182805 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] return visitor(node) 2025-05-07T20:32:26.1889464Z W0507 20:32:26.182805 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:26.1890872Z W0507 20:32:26.182805 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:26.1892089Z W0507 20:32:26.182805 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit 2025-05-07T20:32:26.1893241Z W0507 20:32:26.182805 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] self.visit(item) 2025-05-07T20:32:26.1894537Z W0507 20:32:26.182805 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:26.1896029Z W0507 20:32:26.182805 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:26.1897188Z W0507 20:32:26.182805 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:26.1898175Z W0507 20:32:26.182805 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:26.1898979Z W0507 20:32:26.182805 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ^ 2025-05-07T20:32:26.1900205Z W0507 20:32:26.182805 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:26.3628357Z W0507 20:32:26.358745 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:26.3629561Z W0507 20:32:26.358745 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] Traceback (most recent call last): 2025-05-07T20:32:26.3631196Z W0507 20:32:26.358745 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:26.3632830Z W0507 20:32:26.358745 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:26.3634363Z W0507 20:32:26.358745 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:26.3635899Z W0507 20:32:26.358745 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:26.3637345Z W0507 20:32:26.358745 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:26.3639125Z W0507 20:32:26.358745 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:26.3640681Z W0507 20:32:26.358745 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:26.3642059Z W0507 20:32:26.358745 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] generator.visit(fn.parse()) 2025-05-07T20:32:26.3643401Z W0507 20:32:26.358745 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:26.3644732Z W0507 20:32:26.358745 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ret = super().visit(node) 2025-05-07T20:32:26.3645876Z W0507 20:32:26.358745 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit 2025-05-07T20:32:26.3646988Z W0507 20:32:26.358745 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] return visitor(node) 2025-05-07T20:32:26.3648327Z W0507 20:32:26.358745 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:26.3649736Z W0507 20:32:26.358745 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:26.3650951Z W0507 20:32:26.358745 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit 2025-05-07T20:32:26.3652247Z W0507 20:32:26.358745 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] self.visit(item) 2025-05-07T20:32:26.3653587Z W0507 20:32:26.358745 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:26.3655085Z W0507 20:32:26.358745 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:26.3656244Z W0507 20:32:26.358745 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:26.3657236Z W0507 20:32:26.358745 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:26.3658048Z W0507 20:32:26.358745 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ^ 2025-05-07T20:32:26.3659158Z W0507 20:32:26.358745 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:27.4986281Z self = 2025-05-07T20:32:27.4986998Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:27.4987317Z 2025-05-07T20:32:27.4987405Z @given( 2025-05-07T20:32:27.4987655Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:27.4988079Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:27.4988406Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:27.4988746Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:27.4989120Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:27.4989624Z ) 2025-05-07T20:32:27.4990145Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:27.4990627Z def test_silu_mul_quant( 2025-05-07T20:32:27.4990872Z self, 2025-05-07T20:32:27.4991070Z T: int, 2025-05-07T20:32:27.4991272Z D: int, 2025-05-07T20:32:27.4991490Z scale_ub: Optional[float], 2025-05-07T20:32:27.4991771Z contiguous: bool, 2025-05-07T20:32:27.4992017Z compiled: bool, 2025-05-07T20:32:27.4992241Z ) -> None: 2025-05-07T20:32:27.4992458Z torch.manual_seed(2025) 2025-05-07T20:32:27.4992704Z 2025-05-07T20:32:27.4992974Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:27.4993333Z 2025-05-07T20:32:27.4993527Z x_sign = torch.sign(x) 2025-05-07T20:32:27.4993817Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:27.4994134Z x = x_sign * x_clamp 2025-05-07T20:32:27.4994381Z x0 = x[:, :D] 2025-05-07T20:32:27.4994590Z x1 = x[:, D:] 2025-05-07T20:32:27.4994803Z 2025-05-07T20:32:27.4994990Z if contiguous: 2025-05-07T20:32:27.4995220Z x0 = x0.contiguous() 2025-05-07T20:32:27.4995482Z x1 = x1.contiguous() 2025-05-07T20:32:27.4995728Z 2025-05-07T20:32:27.4995921Z if scale_ub is not None: 2025-05-07T20:32:27.4996190Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:27.4996533Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:27.4996857Z ) 2025-05-07T20:32:27.4997046Z else: 2025-05-07T20:32:27.4997257Z scale_ub_tensor = None 2025-05-07T20:32:27.4997520Z 2025-05-07T20:32:27.4997748Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:27.4998078Z op = silu_mul_quant 2025-05-07T20:32:27.4998334Z if compiled: 2025-05-07T20:32:27.4998578Z op = torch.compile(op) 2025-05-07T20:32:27.4998888Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:27.4999176Z 2025-05-07T20:32:27.4999363Z > y_fp8, y_scale = fn() 2025-05-07T20:32:27.4999538Z 2025-05-07T20:32:27.4999765Z moe/activation_test.py:117: 2025-05-07T20:32:27.5000077Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:27.5000426Z moe/activation_test.py:115: in fn 2025-05-07T20:32:27.5000706Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:27.5001451Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:27.5002200Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:27.5002764Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:27.5003495Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:27.5004207Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:27.5004775Z kernel = self.compile( 2025-05-07T20:32:27.5005347Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:27.5006047Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:27.5006459Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:27.5006700Z 2025-05-07T20:32:27.5006920Z self = 2025-05-07T20:32:27.5008089Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:27.5009595Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3dd6c12040>} 2025-05-07T20:32:27.5011155Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:27.5012263Z context = 2025-05-07T20:32:27.5012567Z 2025-05-07T20:32:27.5012734Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:27.5013283Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:27.5013774Z module_map=module_map) 2025-05-07T20:32:27.5014148Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:27.5014509Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:27.5014777Z E ^ 2025-05-07T20:32:27.5015268Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:27.5015764Z 2025-05-07T20:32:27.5016214Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:27.5016777Z 2025-05-07T20:32:27.5016886Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:27.5017310Z self=, 2025-05-07T20:32:27.5017733Z T=1, 2025-05-07T20:32:27.5017925Z D=7168, 2025-05-07T20:32:27.5018110Z scale_ub=None, 2025-05-07T20:32:27.5018325Z contiguous=True, 2025-05-07T20:32:27.5018549Z compiled=True, 2025-05-07T20:32:27.5018750Z ) 2025-05-07T20:32:27.5019075Z self = 2025-05-07T20:32:27.5019585Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:27.5019858Z 2025-05-07T20:32:27.5019941Z @given( 2025-05-07T20:32:27.5020172Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:27.5020496Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:27.5020818Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:27.5021238Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:27.5021581Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:27.5021885Z ) 2025-05-07T20:32:27.5022244Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:27.5022713Z def test_silu_mul_quant( 2025-05-07T20:32:27.5022962Z self, 2025-05-07T20:32:27.5023159Z T: int, 2025-05-07T20:32:27.5023353Z D: int, 2025-05-07T20:32:27.5023574Z scale_ub: Optional[float], 2025-05-07T20:32:27.5023848Z contiguous: bool, 2025-05-07T20:32:27.5024086Z compiled: bool, 2025-05-07T20:32:27.5024308Z ) -> None: 2025-05-07T20:32:27.5024521Z torch.manual_seed(2025) 2025-05-07T20:32:27.5024761Z 2025-05-07T20:32:27.5025041Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:27.5025397Z 2025-05-07T20:32:27.5025583Z x_sign = torch.sign(x) 2025-05-07T20:32:27.5025884Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:27.5026200Z x = x_sign * x_clamp 2025-05-07T20:32:27.5026436Z x0 = x[:, :D] 2025-05-07T20:32:27.5026652Z x1 = x[:, D:] 2025-05-07T20:32:27.5026865Z 2025-05-07T20:32:27.5027044Z if contiguous: 2025-05-07T20:32:27.5027285Z x0 = x0.contiguous() 2025-05-07T20:32:27.5027549Z x1 = x1.contiguous() 2025-05-07T20:32:27.5027790Z 2025-05-07T20:32:27.5027984Z if scale_ub is not None: 2025-05-07T20:32:27.5028261Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:27.5028601Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:27.5028916Z ) 2025-05-07T20:32:27.5029109Z else: 2025-05-07T20:32:27.5029408Z scale_ub_tensor = None 2025-05-07T20:32:27.5029775Z 2025-05-07T20:32:27.5030010Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:27.5030337Z op = silu_mul_quant 2025-05-07T20:32:27.5030591Z if compiled: 2025-05-07T20:32:27.5030842Z op = torch.compile(op) 2025-05-07T20:32:27.5031146Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:27.5031423Z 2025-05-07T20:32:27.5031616Z y_fp8, y_scale = fn() 2025-05-07T20:32:27.5031907Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:27.5032200Z 2025-05-07T20:32:27.5032439Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:27.5032791Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:27.5033093Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:27.5033410Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:27.5033783Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:27.5034115Z 2025-05-07T20:32:27.5034312Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:27.5034522Z 2025-05-07T20:32:27.5034622Z moe/activation_test.py:126: 2025-05-07T20:32:27.5034935Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:27.5035274Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:27.5035613Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:27.5036460Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:27.5037272Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:27.5037848Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:27.5038582Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:27.5039314Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:27.5040094Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:27.5041004Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:27.5041810Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:27.5042592Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:27.5043266Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:27.5043904Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:27.5044453Z fn() 2025-05-07T20:32:27.5044985Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:27.5045605Z self.fn.run( 2025-05-07T20:32:27.5046095Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:27.5046664Z kernel = self.compile( 2025-05-07T20:32:27.5047231Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:27.5047930Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:27.5048349Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:27.5048590Z 2025-05-07T20:32:27.5048807Z self = 2025-05-07T20:32:27.5049970Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:27.5051560Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3dd6c12dc0>} 2025-05-07T20:32:27.5053046Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:27.5054162Z context = 2025-05-07T20:32:27.5054466Z 2025-05-07T20:32:27.5054642Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:27.5055187Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:27.5055682Z module_map=module_map) 2025-05-07T20:32:27.5056061Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:27.5056421Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:27.5056700Z E ^ 2025-05-07T20:32:27.5057193Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:27.5057686Z 2025-05-07T20:32:27.5058146Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:27.5058701Z 2025-05-07T20:32:27.5058805Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:27.5059230Z self=, 2025-05-07T20:32:27.5059654Z T=4096, 2025-05-07T20:32:27.5059837Z D=5120, 2025-05-07T20:32:27.5060037Z scale_ub=None, 2025-05-07T20:32:27.5060254Z contiguous=False, 2025-05-07T20:32:27.5060482Z compiled=False, 2025-05-07T20:32:27.5060687Z ) 2025-05-07T20:32:28.1802460Z W0507 20:32:28.176104 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:28.1803681Z W0507 20:32:28.176104 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] Traceback (most recent call last): 2025-05-07T20:32:28.1805525Z W0507 20:32:28.176104 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:28.1807121Z W0507 20:32:28.176104 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:28.1808646Z W0507 20:32:28.176104 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:28.1810180Z W0507 20:32:28.176104 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:28.1811629Z W0507 20:32:28.176104 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:28.1813150Z W0507 20:32:28.176104 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:28.1814714Z W0507 20:32:28.176104 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:28.1816087Z W0507 20:32:28.176104 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] generator.visit(fn.parse()) 2025-05-07T20:32:28.1817591Z W0507 20:32:28.176104 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:28.1818923Z W0507 20:32:28.176104 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ret = super().visit(node) 2025-05-07T20:32:28.1820062Z W0507 20:32:28.176104 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit 2025-05-07T20:32:28.1821181Z W0507 20:32:28.176104 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] return visitor(node) 2025-05-07T20:32:28.1822517Z W0507 20:32:28.176104 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:28.1823942Z W0507 20:32:28.176104 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:28.1825166Z W0507 20:32:28.176104 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit 2025-05-07T20:32:28.1826308Z W0507 20:32:28.176104 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] self.visit(item) 2025-05-07T20:32:28.1827599Z W0507 20:32:28.176104 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:28.1829098Z W0507 20:32:28.176104 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:28.1830468Z W0507 20:32:28.176104 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:28.1831459Z W0507 20:32:28.176104 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:28.1832257Z W0507 20:32:28.176104 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ^ 2025-05-07T20:32:28.1833367Z W0507 20:32:28.176104 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:28.8518511Z W0507 20:32:28.847853 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:28.8519695Z W0507 20:32:28.847853 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] Traceback (most recent call last): 2025-05-07T20:32:28.8521210Z W0507 20:32:28.847853 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:28.8522788Z W0507 20:32:28.847853 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:28.8524310Z W0507 20:32:28.847853 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:28.8525841Z W0507 20:32:28.847853 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:28.8527483Z W0507 20:32:28.847853 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:28.8529001Z W0507 20:32:28.847853 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:28.8530569Z W0507 20:32:28.847853 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:28.8531934Z W0507 20:32:28.847853 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] generator.visit(fn.parse()) 2025-05-07T20:32:28.8533306Z W0507 20:32:28.847853 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:28.8534670Z W0507 20:32:28.847853 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ret = super().visit(node) 2025-05-07T20:32:28.8535807Z W0507 20:32:28.847853 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit 2025-05-07T20:32:28.8536927Z W0507 20:32:28.847853 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] return visitor(node) 2025-05-07T20:32:28.8538270Z W0507 20:32:28.847853 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:28.8539691Z W0507 20:32:28.847853 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:28.8541335Z W0507 20:32:28.847853 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit 2025-05-07T20:32:28.8542483Z W0507 20:32:28.847853 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] self.visit(item) 2025-05-07T20:32:28.8543779Z W0507 20:32:28.847853 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:28.8545268Z W0507 20:32:28.847853 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:28.8546433Z W0507 20:32:28.847853 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:28.8547427Z W0507 20:32:28.847853 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:28.8548226Z W0507 20:32:28.847853 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ^ 2025-05-07T20:32:28.8549337Z W0507 20:32:28.847853 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:30.1597904Z self = 2025-05-07T20:32:30.1598525Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:30.1598827Z 2025-05-07T20:32:30.1598930Z @given( 2025-05-07T20:32:30.1599161Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:30.1599758Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:30.1600074Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:30.1600414Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:30.1600752Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:30.1601051Z ) 2025-05-07T20:32:30.1601413Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:30.1601874Z def test_silu_mul_quant( 2025-05-07T20:32:30.1602118Z self, 2025-05-07T20:32:30.1602308Z T: int, 2025-05-07T20:32:30.1602494Z D: int, 2025-05-07T20:32:30.1602709Z scale_ub: Optional[float], 2025-05-07T20:32:30.1602985Z contiguous: bool, 2025-05-07T20:32:30.1603219Z compiled: bool, 2025-05-07T20:32:30.1603444Z ) -> None: 2025-05-07T20:32:30.1603656Z torch.manual_seed(2025) 2025-05-07T20:32:30.1603898Z 2025-05-07T20:32:30.1604181Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:30.1604549Z 2025-05-07T20:32:30.1610601Z x_sign = torch.sign(x) 2025-05-07T20:32:30.1610929Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:30.1611258Z x = x_sign * x_clamp 2025-05-07T20:32:30.1611512Z x0 = x[:, :D] 2025-05-07T20:32:30.1611727Z x1 = x[:, D:] 2025-05-07T20:32:30.1611941Z 2025-05-07T20:32:30.1612133Z if contiguous: 2025-05-07T20:32:30.1612364Z x0 = x0.contiguous() 2025-05-07T20:32:30.1612634Z x1 = x1.contiguous() 2025-05-07T20:32:30.1612884Z 2025-05-07T20:32:30.1613071Z if scale_ub is not None: 2025-05-07T20:32:30.1613373Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:30.1613758Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:30.1614085Z ) 2025-05-07T20:32:30.1614276Z else: 2025-05-07T20:32:30.1614490Z scale_ub_tensor = None 2025-05-07T20:32:30.1614761Z 2025-05-07T20:32:30.1614993Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:30.1615327Z op = silu_mul_quant 2025-05-07T20:32:30.1615754Z if compiled: 2025-05-07T20:32:30.1616007Z op = torch.compile(op) 2025-05-07T20:32:30.1616315Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:30.1616611Z 2025-05-07T20:32:30.1616805Z > y_fp8, y_scale = fn() 2025-05-07T20:32:30.1616984Z 2025-05-07T20:32:30.1617085Z moe/activation_test.py:117: 2025-05-07T20:32:30.1617396Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:30.1617745Z moe/activation_test.py:115: in fn 2025-05-07T20:32:30.1618028Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:30.1618770Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:30.1619517Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:30.1620083Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:30.1620827Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:30.1621563Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:30.1622137Z kernel = self.compile( 2025-05-07T20:32:30.1622714Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:30.1623412Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:30.1623834Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:30.1624077Z 2025-05-07T20:32:30.1624298Z self = 2025-05-07T20:32:30.1625470Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:30.1627064Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3dd6c120d0>} 2025-05-07T20:32:30.1628536Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:30.1629742Z context = 2025-05-07T20:32:30.1630048Z 2025-05-07T20:32:30.1630225Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:30.1630771Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:30.1631274Z module_map=module_map) 2025-05-07T20:32:30.1631649Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:30.1632004Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:30.1632275Z E ^ 2025-05-07T20:32:30.1632765Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:30.1633255Z 2025-05-07T20:32:30.1633706Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:30.1634259Z 2025-05-07T20:32:30.1634359Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:30.1634785Z self=, 2025-05-07T20:32:30.1635206Z T=4096, 2025-05-07T20:32:30.1635385Z D=7168, 2025-05-07T20:32:30.1635577Z scale_ub=None, 2025-05-07T20:32:30.1635798Z contiguous=False, 2025-05-07T20:32:30.1636028Z compiled=False, 2025-05-07T20:32:30.1636246Z ) 2025-05-07T20:32:30.1636575Z self = 2025-05-07T20:32:30.1637094Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:30.1637506Z 2025-05-07T20:32:30.1637584Z @given( 2025-05-07T20:32:30.1637815Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:30.1638137Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:30.1638446Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:30.1638788Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:30.1639123Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:30.1639410Z ) 2025-05-07T20:32:30.1639771Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:30.1640236Z def test_silu_mul_quant( 2025-05-07T20:32:30.1640482Z self, 2025-05-07T20:32:30.1640671Z T: int, 2025-05-07T20:32:30.1640866Z D: int, 2025-05-07T20:32:30.1641087Z scale_ub: Optional[float], 2025-05-07T20:32:30.1641356Z contiguous: bool, 2025-05-07T20:32:30.1641597Z compiled: bool, 2025-05-07T20:32:30.1641823Z ) -> None: 2025-05-07T20:32:30.1642040Z torch.manual_seed(2025) 2025-05-07T20:32:30.1642290Z 2025-05-07T20:32:30.1642563Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:30.1642914Z 2025-05-07T20:32:30.1643106Z x_sign = torch.sign(x) 2025-05-07T20:32:30.1643407Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:30.1643717Z x = x_sign * x_clamp 2025-05-07T20:32:30.1643960Z x0 = x[:, :D] 2025-05-07T20:32:30.1644176Z x1 = x[:, D:] 2025-05-07T20:32:30.1644375Z 2025-05-07T20:32:30.1644561Z if contiguous: 2025-05-07T20:32:30.1644788Z x0 = x0.contiguous() 2025-05-07T20:32:30.1645039Z x1 = x1.contiguous() 2025-05-07T20:32:30.1645278Z 2025-05-07T20:32:30.1645552Z if scale_ub is not None: 2025-05-07T20:32:30.1645828Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:30.1646163Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:30.1646488Z ) 2025-05-07T20:32:30.1646673Z else: 2025-05-07T20:32:30.1646874Z scale_ub_tensor = None 2025-05-07T20:32:30.1647125Z 2025-05-07T20:32:30.1647358Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:30.1647675Z op = silu_mul_quant 2025-05-07T20:32:30.1647928Z if compiled: 2025-05-07T20:32:30.1648175Z op = torch.compile(op) 2025-05-07T20:32:30.1648470Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:30.1648748Z 2025-05-07T20:32:30.1648943Z > y_fp8, y_scale = fn() 2025-05-07T20:32:30.1649109Z 2025-05-07T20:32:30.1649205Z moe/activation_test.py:117: 2025-05-07T20:32:30.1649505Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:30.1649859Z moe/activation_test.py:115: in fn 2025-05-07T20:32:30.1650147Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:30.1650884Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:30.1651633Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:30.1652197Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:30.1652928Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:30.1653667Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:30.1654258Z kernel = self.compile( 2025-05-07T20:32:30.1654828Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:30.1655524Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:30.1655945Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:30.1656186Z 2025-05-07T20:32:30.1656484Z self = 2025-05-07T20:32:30.1657651Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:30.1659144Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3dd65f33a0>} 2025-05-07T20:32:30.1660605Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:30.1661708Z context = 2025-05-07T20:32:30.1662016Z 2025-05-07T20:32:30.1662189Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:30.1662732Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:30.1663219Z module_map=module_map) 2025-05-07T20:32:30.1663594Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:30.1663954Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:30.1664208Z E ^ 2025-05-07T20:32:30.1664702Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:30.1665189Z 2025-05-07T20:32:30.1665638Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:30.1666193Z 2025-05-07T20:32:30.1666299Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:30.1666798Z self=, 2025-05-07T20:32:30.1667216Z T=128, 2025-05-07T20:32:30.1667401Z D=7168, 2025-05-07T20:32:30.1667585Z scale_ub=None, 2025-05-07T20:32:30.1667802Z contiguous=False, 2025-05-07T20:32:30.1668032Z compiled=True, 2025-05-07T20:32:30.1668228Z ) 2025-05-07T20:32:30.2432930Z self = 2025-05-07T20:32:30.2433507Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:30.2433794Z 2025-05-07T20:32:30.2433880Z @given( 2025-05-07T20:32:30.2434111Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:30.2434438Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:30.2434756Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:30.2435097Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:30.2435434Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:30.2435741Z ) 2025-05-07T20:32:30.2436106Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:30.2436570Z def test_silu_mul_quant( 2025-05-07T20:32:30.2436822Z self, 2025-05-07T20:32:30.2437024Z T: int, 2025-05-07T20:32:30.2437221Z D: int, 2025-05-07T20:32:30.2437447Z scale_ub: Optional[float], 2025-05-07T20:32:30.2437728Z contiguous: bool, 2025-05-07T20:32:30.2437970Z compiled: bool, 2025-05-07T20:32:30.2438200Z ) -> None: 2025-05-07T20:32:30.2438421Z torch.manual_seed(2025) 2025-05-07T20:32:30.2438667Z 2025-05-07T20:32:30.2438948Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:30.2439307Z 2025-05-07T20:32:30.2439504Z x_sign = torch.sign(x) 2025-05-07T20:32:30.2439799Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:30.2440119Z x = x_sign * x_clamp 2025-05-07T20:32:30.2440363Z x0 = x[:, :D] 2025-05-07T20:32:30.2440586Z x1 = x[:, D:] 2025-05-07T20:32:30.2440802Z 2025-05-07T20:32:30.2440989Z if contiguous: 2025-05-07T20:32:30.2441222Z x0 = x0.contiguous() 2025-05-07T20:32:30.2441657Z x1 = x1.contiguous() 2025-05-07T20:32:30.2441902Z 2025-05-07T20:32:30.2442086Z if scale_ub is not None: 2025-05-07T20:32:30.2442364Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:30.2442705Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:30.2443017Z ) 2025-05-07T20:32:30.2443208Z else: 2025-05-07T20:32:30.2443416Z scale_ub_tensor = None 2025-05-07T20:32:30.2443663Z 2025-05-07T20:32:30.2443890Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:30.2444215Z op = silu_mul_quant 2025-05-07T20:32:30.2444465Z if compiled: 2025-05-07T20:32:30.2444708Z op = torch.compile(op) 2025-05-07T20:32:30.2445011Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:30.2445296Z 2025-05-07T20:32:30.2445478Z y_fp8, y_scale = fn() 2025-05-07T20:32:30.2445765Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:30.2446069Z 2025-05-07T20:32:30.2446300Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:30.2446652Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:30.2446948Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:30.2447271Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:30.2447642Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:30.2447962Z 2025-05-07T20:32:30.2448153Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:30.2448360Z 2025-05-07T20:32:30.2448458Z moe/activation_test.py:126: 2025-05-07T20:32:30.2448759Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:30.2449096Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:30.2449563Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:30.2450413Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:30.2451225Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:30.2451792Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:30.2452522Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:30.2453257Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:30.2454020Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:30.2454822Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:30.2455623Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:30.2456408Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:30.2457086Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:30.2457726Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:30.2458282Z fn() 2025-05-07T20:32:30.2458818Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:30.2459433Z self.fn.run( 2025-05-07T20:32:30.2459918Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:30.2460482Z kernel = self.compile( 2025-05-07T20:32:30.2461042Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:30.2461745Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:30.2462158Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:30.2462482Z 2025-05-07T20:32:30.2462703Z self = 2025-05-07T20:32:30.2463873Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:30.2465395Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3dd6791700>} 2025-05-07T20:32:30.2466872Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:30.2467990Z context = 2025-05-07T20:32:30.2468295Z 2025-05-07T20:32:30.2468474Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:30.2469016Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:30.2469505Z module_map=module_map) 2025-05-07T20:32:30.2470011Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:30.2470369Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:30.2470636Z E ^ 2025-05-07T20:32:30.2471124Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:30.2471614Z 2025-05-07T20:32:30.2472064Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:30.2472730Z 2025-05-07T20:32:30.2472831Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:30.2473252Z self=, 2025-05-07T20:32:30.2473674Z T=128, 2025-05-07T20:32:30.2473851Z D=7168, 2025-05-07T20:32:30.2474043Z scale_ub=None, 2025-05-07T20:32:30.2474255Z contiguous=False, 2025-05-07T20:32:30.2474472Z compiled=False, 2025-05-07T20:32:30.2474675Z ) 2025-05-07T20:32:30.6599527Z self = 2025-05-07T20:32:30.6600106Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:30.6600399Z 2025-05-07T20:32:30.6600522Z @given( 2025-05-07T20:32:30.6600761Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:30.6601098Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:30.6601410Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:30.6601742Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:30.6602087Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:30.6602384Z ) 2025-05-07T20:32:30.6602743Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:30.6603209Z def test_silu_mul_quant( 2025-05-07T20:32:30.6603449Z self, 2025-05-07T20:32:30.6603642Z T: int, 2025-05-07T20:32:30.6603833Z D: int, 2025-05-07T20:32:30.6604049Z scale_ub: Optional[float], 2025-05-07T20:32:30.6604324Z contiguous: bool, 2025-05-07T20:32:30.6604559Z compiled: bool, 2025-05-07T20:32:30.6604784Z ) -> None: 2025-05-07T20:32:30.6604997Z torch.manual_seed(2025) 2025-05-07T20:32:30.6605232Z 2025-05-07T20:32:30.6605506Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:30.6605862Z 2025-05-07T20:32:30.6606044Z x_sign = torch.sign(x) 2025-05-07T20:32:30.6606333Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:30.6606655Z x = x_sign * x_clamp 2025-05-07T20:32:30.6606888Z x0 = x[:, :D] 2025-05-07T20:32:30.6607106Z x1 = x[:, D:] 2025-05-07T20:32:30.6607311Z 2025-05-07T20:32:30.6607655Z if contiguous: 2025-05-07T20:32:30.6607896Z x0 = x0.contiguous() 2025-05-07T20:32:30.6608156Z x1 = x1.contiguous() 2025-05-07T20:32:30.6608391Z 2025-05-07T20:32:30.6608581Z if scale_ub is not None: 2025-05-07T20:32:30.6608858Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:30.6609197Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:30.6609506Z ) 2025-05-07T20:32:30.6609699Z else: 2025-05-07T20:32:30.6609909Z scale_ub_tensor = None 2025-05-07T20:32:30.6610156Z 2025-05-07T20:32:30.6610379Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:30.6610703Z op = silu_mul_quant 2025-05-07T20:32:30.6610951Z if compiled: 2025-05-07T20:32:30.6611204Z op = torch.compile(op) 2025-05-07T20:32:30.6611508Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:30.6611785Z 2025-05-07T20:32:30.6611981Z > y_fp8, y_scale = fn() 2025-05-07T20:32:30.6612148Z 2025-05-07T20:32:30.6612256Z moe/activation_test.py:117: 2025-05-07T20:32:30.6612556Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:30.6612904Z moe/activation_test.py:115: in fn 2025-05-07T20:32:30.6613194Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:30.6613933Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:30.6614676Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:30.6615252Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:30.6615993Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:30.6616829Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:30.6617392Z kernel = self.compile( 2025-05-07T20:32:30.6617969Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:30.6618668Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:30.6619073Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:30.6619319Z 2025-05-07T20:32:30.6619531Z self = 2025-05-07T20:32:30.6620700Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:30.6622215Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3dd6143310>} 2025-05-07T20:32:30.6623695Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:30.6624801Z context = 2025-05-07T20:32:30.6625112Z 2025-05-07T20:32:30.6625281Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:30.6625832Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:30.6626322Z module_map=module_map) 2025-05-07T20:32:30.6626690Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:30.6627052Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:30.6627317Z E ^ 2025-05-07T20:32:30.6627806Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:30.6628300Z 2025-05-07T20:32:30.6628833Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:30.6629398Z 2025-05-07T20:32:30.6629501Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:30.6630074Z self=, 2025-05-07T20:32:30.6630492Z T=4096, 2025-05-07T20:32:30.6630678Z D=5120, 2025-05-07T20:32:30.6630872Z scale_ub=1200.0, 2025-05-07T20:32:30.6631092Z contiguous=True, 2025-05-07T20:32:30.6631313Z compiled=False, 2025-05-07T20:32:30.6631518Z ) 2025-05-07T20:32:30.6631841Z self = 2025-05-07T20:32:30.6632359Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:30.6632652Z 2025-05-07T20:32:30.6632734Z @given( 2025-05-07T20:32:30.6632957Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:30.6633272Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:30.6633592Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:30.6633932Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:30.6634262Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:30.6634557Z ) 2025-05-07T20:32:30.6634914Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:30.6635369Z def test_silu_mul_quant( 2025-05-07T20:32:30.6635612Z self, 2025-05-07T20:32:30.6635830Z T: int, 2025-05-07T20:32:30.6636025Z D: int, 2025-05-07T20:32:30.6636235Z scale_ub: Optional[float], 2025-05-07T20:32:30.6636509Z contiguous: bool, 2025-05-07T20:32:30.6636748Z compiled: bool, 2025-05-07T20:32:30.6636961Z ) -> None: 2025-05-07T20:32:30.6637269Z torch.manual_seed(2025) 2025-05-07T20:32:30.6637514Z 2025-05-07T20:32:30.6637783Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:30.6638146Z 2025-05-07T20:32:30.6638345Z x_sign = torch.sign(x) 2025-05-07T20:32:30.6638638Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:30.6638958Z x = x_sign * x_clamp 2025-05-07T20:32:30.6639201Z x0 = x[:, :D] 2025-05-07T20:32:30.6639421Z x1 = x[:, D:] 2025-05-07T20:32:30.6639626Z 2025-05-07T20:32:30.6639820Z if contiguous: 2025-05-07T20:32:30.6640059Z x0 = x0.contiguous() 2025-05-07T20:32:30.6640319Z x1 = x1.contiguous() 2025-05-07T20:32:30.6640572Z 2025-05-07T20:32:30.6640767Z if scale_ub is not None: 2025-05-07T20:32:30.6641040Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:30.6641383Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:30.6641699Z ) 2025-05-07T20:32:30.6641891Z else: 2025-05-07T20:32:30.6642097Z scale_ub_tensor = None 2025-05-07T20:32:30.6642349Z 2025-05-07T20:32:30.6642571Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:30.6642902Z op = silu_mul_quant 2025-05-07T20:32:30.6643152Z if compiled: 2025-05-07T20:32:30.6643393Z op = torch.compile(op) 2025-05-07T20:32:30.6643693Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:30.6643972Z 2025-05-07T20:32:30.6644158Z > y_fp8, y_scale = fn() 2025-05-07T20:32:30.6644322Z 2025-05-07T20:32:30.6644417Z moe/activation_test.py:117: 2025-05-07T20:32:30.6644718Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:30.6645057Z moe/activation_test.py:115: in fn 2025-05-07T20:32:30.6645336Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:30.6651328Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:30.6652153Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:30.6652848Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:30.6653594Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:30.6654301Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:30.6654872Z kernel = self.compile( 2025-05-07T20:32:30.6655439Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:30.6656140Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:30.6656556Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:30.6656797Z 2025-05-07T20:32:30.6657017Z self = 2025-05-07T20:32:30.6658194Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:30.6659699Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3dd60f3c10>} 2025-05-07T20:32:30.6661168Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:30.6662276Z context = 2025-05-07T20:32:30.6662580Z 2025-05-07T20:32:30.6662749Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:30.6663298Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:30.6663877Z module_map=module_map) 2025-05-07T20:32:30.6664265Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:30.6664623Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:30.6664890Z E ^ 2025-05-07T20:32:30.6665382Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:30.6665869Z 2025-05-07T20:32:30.6666319Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:30.6666877Z 2025-05-07T20:32:30.6666982Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:30.6667408Z self=, 2025-05-07T20:32:30.6667828Z T=1, 2025-05-07T20:32:30.6668003Z D=5120, 2025-05-07T20:32:30.6668199Z scale_ub=None, 2025-05-07T20:32:30.6668418Z contiguous=True, 2025-05-07T20:32:30.6668637Z compiled=True, 2025-05-07T20:32:30.6668841Z ) 2025-05-07T20:32:31.1819829Z W0507 20:32:31.177882 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:31.1820991Z W0507 20:32:31.177882 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] Traceback (most recent call last): 2025-05-07T20:32:31.1822486Z W0507 20:32:31.177882 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:31.1824119Z W0507 20:32:31.177882 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:31.1825839Z W0507 20:32:31.177882 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:31.1827385Z W0507 20:32:31.177882 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:31.1828831Z W0507 20:32:31.177882 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:31.1830420Z W0507 20:32:31.177882 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:31.1831979Z W0507 20:32:31.177882 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:31.1833357Z W0507 20:32:31.177882 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] generator.visit(fn.parse()) 2025-05-07T20:32:31.1834686Z W0507 20:32:31.177882 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:31.1836011Z W0507 20:32:31.177882 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ret = super().visit(node) 2025-05-07T20:32:31.1837145Z W0507 20:32:31.177882 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit 2025-05-07T20:32:31.1838259Z W0507 20:32:31.177882 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] return visitor(node) 2025-05-07T20:32:31.1839728Z W0507 20:32:31.177882 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:31.1841136Z W0507 20:32:31.177882 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:31.1842361Z W0507 20:32:31.177882 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit 2025-05-07T20:32:31.1843502Z W0507 20:32:31.177882 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] self.visit(item) 2025-05-07T20:32:31.1844842Z W0507 20:32:31.177882 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:31.1846343Z W0507 20:32:31.177882 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:31.1847493Z W0507 20:32:31.177882 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:31.1848482Z W0507 20:32:31.177882 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:31.1849280Z W0507 20:32:31.177882 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ^ 2025-05-07T20:32:31.1850393Z W0507 20:32:31.177882 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:31.3719637Z W0507 20:32:31.367869 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:31.3721912Z W0507 20:32:31.367869 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] Traceback (most recent call last): 2025-05-07T20:32:31.3724285Z W0507 20:32:31.367869 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:31.3725849Z W0507 20:32:31.367869 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:31.3727372Z W0507 20:32:31.367869 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:31.3728902Z W0507 20:32:31.367869 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:31.3730339Z W0507 20:32:31.367869 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:31.3731845Z W0507 20:32:31.367869 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:31.3733405Z W0507 20:32:31.367869 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:31.3734903Z W0507 20:32:31.367869 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] generator.visit(fn.parse()) 2025-05-07T20:32:31.3736250Z W0507 20:32:31.367869 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:31.3737577Z W0507 20:32:31.367869 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ret = super().visit(node) 2025-05-07T20:32:31.3738701Z W0507 20:32:31.367869 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit 2025-05-07T20:32:31.3739817Z W0507 20:32:31.367869 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] return visitor(node) 2025-05-07T20:32:31.3741164Z W0507 20:32:31.367869 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:31.3742572Z W0507 20:32:31.367869 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:31.3743789Z W0507 20:32:31.367869 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit 2025-05-07T20:32:31.3744927Z W0507 20:32:31.367869 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] self.visit(item) 2025-05-07T20:32:31.3746222Z W0507 20:32:31.367869 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:31.3747808Z W0507 20:32:31.367869 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:31.3748963Z W0507 20:32:31.367869 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:31.3750069Z W0507 20:32:31.367869 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:31.3750864Z W0507 20:32:31.367869 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ^ 2025-05-07T20:32:31.3751971Z W0507 20:32:31.367869 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:31.8745161Z self = 2025-05-07T20:32:31.8745736Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:31.8746021Z 2025-05-07T20:32:31.8746115Z @given( 2025-05-07T20:32:31.8746349Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:31.8746672Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:31.8746976Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:31.8747308Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:31.8747640Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:31.8747928Z ) 2025-05-07T20:32:31.8748285Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:31.8748747Z def test_silu_mul_quant( 2025-05-07T20:32:31.8748984Z self, 2025-05-07T20:32:31.8749178Z T: int, 2025-05-07T20:32:31.8749370Z D: int, 2025-05-07T20:32:31.8749581Z scale_ub: Optional[float], 2025-05-07T20:32:31.8750166Z contiguous: bool, 2025-05-07T20:32:31.8750406Z compiled: bool, 2025-05-07T20:32:31.8750627Z ) -> None: 2025-05-07T20:32:31.8750846Z torch.manual_seed(2025) 2025-05-07T20:32:31.8751088Z 2025-05-07T20:32:31.8751356Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:31.8751710Z 2025-05-07T20:32:31.8751902Z x_sign = torch.sign(x) 2025-05-07T20:32:31.8752192Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:31.8752505Z x = x_sign * x_clamp 2025-05-07T20:32:31.8752747Z x0 = x[:, :D] 2025-05-07T20:32:31.8752957Z x1 = x[:, D:] 2025-05-07T20:32:31.8753163Z 2025-05-07T20:32:31.8753346Z if contiguous: 2025-05-07T20:32:31.8753578Z x0 = x0.contiguous() 2025-05-07T20:32:31.8753830Z x1 = x1.contiguous() 2025-05-07T20:32:31.8754069Z 2025-05-07T20:32:31.8754254Z if scale_ub is not None: 2025-05-07T20:32:31.8754533Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:31.8754879Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:31.8755188Z ) 2025-05-07T20:32:31.8755375Z else: 2025-05-07T20:32:31.8755576Z scale_ub_tensor = None 2025-05-07T20:32:31.8755826Z 2025-05-07T20:32:31.8756049Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:31.8756368Z op = silu_mul_quant 2025-05-07T20:32:31.8756623Z if compiled: 2025-05-07T20:32:31.8756866Z op = torch.compile(op) 2025-05-07T20:32:31.8757162Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:31.8757442Z 2025-05-07T20:32:31.8757632Z y_fp8, y_scale = fn() 2025-05-07T20:32:31.8757908Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:31.8758206Z 2025-05-07T20:32:31.8758439Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:31.8758776Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:31.8759091Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:31.8759413Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:31.8759898Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:31.8760223Z 2025-05-07T20:32:31.8760424Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:31.8760625Z 2025-05-07T20:32:31.8760728Z moe/activation_test.py:126: 2025-05-07T20:32:31.8761021Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:31.8761366Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:31.8761694Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:31.8762533Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:31.8763341Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:31.8763916Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:31.8764646Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:31.8765382Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:31.8766157Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:31.8766962Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:31.8767761Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:31.8768535Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:31.8769211Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:31.8769931Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:31.8770478Z fn() 2025-05-07T20:32:31.8771009Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:31.8771624Z self.fn.run( 2025-05-07T20:32:31.8772110Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:31.8772665Z kernel = self.compile( 2025-05-07T20:32:31.8773228Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:31.8773925Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:31.8774333Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:31.8774579Z 2025-05-07T20:32:31.8774790Z self = 2025-05-07T20:32:31.8775968Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:31.8777490Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3dd5e9e3a0>} 2025-05-07T20:32:31.8778955Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:31.8780057Z context = 2025-05-07T20:32:31.8780363Z 2025-05-07T20:32:31.8780531Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:31.8781078Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:31.8781575Z module_map=module_map) 2025-05-07T20:32:31.8781943Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:31.8782420Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:31.8782688Z E ^ 2025-05-07T20:32:31.8783438Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:31.8783932Z 2025-05-07T20:32:31.8784379Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:31.8784939Z 2025-05-07T20:32:31.8785040Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:31.8785463Z self=, 2025-05-07T20:32:31.8785881Z T=2048, 2025-05-07T20:32:31.8786064Z D=5120, 2025-05-07T20:32:31.8786251Z scale_ub=None, 2025-05-07T20:32:31.8786455Z contiguous=True, 2025-05-07T20:32:31.8786681Z compiled=True, 2025-05-07T20:32:31.8786879Z ) 2025-05-07T20:32:32.3517187Z W0507 20:32:32.347668 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:32.3518356Z W0507 20:32:32.347668 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] Traceback (most recent call last): 2025-05-07T20:32:32.3519838Z W0507 20:32:32.347668 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:32.3521406Z W0507 20:32:32.347668 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:32.3522914Z W0507 20:32:32.347668 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:32.3524609Z W0507 20:32:32.347668 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:32.3526035Z W0507 20:32:32.347668 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:32.3527546Z W0507 20:32:32.347668 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:32.3529099Z W0507 20:32:32.347668 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:32.3530480Z W0507 20:32:32.347668 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] generator.visit(fn.parse()) 2025-05-07T20:32:32.3531809Z W0507 20:32:32.347668 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:32.3533139Z W0507 20:32:32.347668 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ret = super().visit(node) 2025-05-07T20:32:32.3534267Z W0507 20:32:32.347668 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit 2025-05-07T20:32:32.3535377Z W0507 20:32:32.347668 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] return visitor(node) 2025-05-07T20:32:32.3536836Z W0507 20:32:32.347668 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:32.3538250Z W0507 20:32:32.347668 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:32.3539465Z W0507 20:32:32.347668 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit 2025-05-07T20:32:32.3540599Z W0507 20:32:32.347668 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] self.visit(item) 2025-05-07T20:32:32.3541890Z W0507 20:32:32.347668 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:32.3543388Z W0507 20:32:32.347668 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:32.3544593Z W0507 20:32:32.347668 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:32.3545577Z W0507 20:32:32.347668 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:32.3546375Z W0507 20:32:32.347668 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ^ 2025-05-07T20:32:32.3547480Z W0507 20:32:32.347668 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:32.5393063Z W0507 20:32:32.535313 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:32.5394431Z W0507 20:32:32.535313 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] Traceback (most recent call last): 2025-05-07T20:32:32.5395916Z W0507 20:32:32.535313 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:32.5397474Z W0507 20:32:32.535313 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:32.5398988Z W0507 20:32:32.535313 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:32.5400517Z W0507 20:32:32.535313 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:32.5401944Z W0507 20:32:32.535313 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:32.5403452Z W0507 20:32:32.535313 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:32.5405051Z W0507 20:32:32.535313 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:32.5406421Z W0507 20:32:32.535313 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] generator.visit(fn.parse()) 2025-05-07T20:32:32.5407927Z W0507 20:32:32.535313 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:32.5409254Z W0507 20:32:32.535313 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ret = super().visit(node) 2025-05-07T20:32:32.5410390Z W0507 20:32:32.535313 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit 2025-05-07T20:32:32.5411497Z W0507 20:32:32.535313 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] return visitor(node) 2025-05-07T20:32:32.5412837Z W0507 20:32:32.535313 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:32.5414290Z W0507 20:32:32.535313 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:32.5415505Z W0507 20:32:32.535313 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit 2025-05-07T20:32:32.5416642Z W0507 20:32:32.535313 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] self.visit(item) 2025-05-07T20:32:32.5417927Z W0507 20:32:32.535313 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:32.5419544Z W0507 20:32:32.535313 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:32.5420702Z W0507 20:32:32.535313 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:32.5421685Z W0507 20:32:32.535313 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:32.5422479Z W0507 20:32:32.535313 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ^ 2025-05-07T20:32:32.5423584Z W0507 20:32:32.535313 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:33.0414577Z self = 2025-05-07T20:32:33.0415162Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:33.0415468Z 2025-05-07T20:32:33.0415550Z @given( 2025-05-07T20:32:33.0415792Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:33.0416119Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:33.0416433Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:33.0416775Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:33.0417115Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:33.0417415Z ) 2025-05-07T20:32:33.0417771Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:33.0418238Z def test_silu_mul_quant( 2025-05-07T20:32:33.0418480Z self, 2025-05-07T20:32:33.0418669Z T: int, 2025-05-07T20:32:33.0418867Z D: int, 2025-05-07T20:32:33.0419085Z scale_ub: Optional[float], 2025-05-07T20:32:33.0419354Z contiguous: bool, 2025-05-07T20:32:33.0419595Z compiled: bool, 2025-05-07T20:32:33.0419821Z ) -> None: 2025-05-07T20:32:33.0420027Z torch.manual_seed(2025) 2025-05-07T20:32:33.0420434Z 2025-05-07T20:32:33.0420718Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:33.0421069Z 2025-05-07T20:32:33.0421264Z x_sign = torch.sign(x) 2025-05-07T20:32:33.0421562Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:33.0421897Z x = x_sign * x_clamp 2025-05-07T20:32:33.0422139Z x0 = x[:, :D] 2025-05-07T20:32:33.0422350Z x1 = x[:, D:] 2025-05-07T20:32:33.0422549Z 2025-05-07T20:32:33.0422732Z if contiguous: 2025-05-07T20:32:33.0422958Z x0 = x0.contiguous() 2025-05-07T20:32:33.0423216Z x1 = x1.contiguous() 2025-05-07T20:32:33.0423453Z 2025-05-07T20:32:33.0423645Z if scale_ub is not None: 2025-05-07T20:32:33.0423919Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:33.0424263Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:33.0424581Z ) 2025-05-07T20:32:33.0424768Z else: 2025-05-07T20:32:33.0424980Z scale_ub_tensor = None 2025-05-07T20:32:33.0425232Z 2025-05-07T20:32:33.0425460Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:33.0425782Z op = silu_mul_quant 2025-05-07T20:32:33.0426031Z if compiled: 2025-05-07T20:32:33.0426275Z op = torch.compile(op) 2025-05-07T20:32:33.0426572Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:33.0426851Z 2025-05-07T20:32:33.0427035Z y_fp8, y_scale = fn() 2025-05-07T20:32:33.0427320Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:33.0427620Z 2025-05-07T20:32:33.0427849Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:33.0428191Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:33.0428613Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:33.0428926Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:33.0429303Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:33.0429621Z 2025-05-07T20:32:33.0429899Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:33.0430107Z 2025-05-07T20:32:33.0430206Z moe/activation_test.py:126: 2025-05-07T20:32:33.0430509Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:33.0430856Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:33.0431183Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:33.0432036Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:33.0432853Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:33.0433426Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:33.0434168Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:33.0434908Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:33.0435677Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:33.0436477Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:33.0437280Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:33.0438058Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:33.0438739Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:33.0439373Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:33.0439931Z fn() 2025-05-07T20:32:33.0440545Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:33.0441161Z self.fn.run( 2025-05-07T20:32:33.0441646Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:33.0442207Z kernel = self.compile( 2025-05-07T20:32:33.0442778Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:33.0443466Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:33.0443879Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:33.0444123Z 2025-05-07T20:32:33.0444342Z self = 2025-05-07T20:32:33.0445516Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:33.0447030Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3dd5cd69d0>} 2025-05-07T20:32:33.0448501Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:33.0449607Z context = 2025-05-07T20:32:33.0449910Z 2025-05-07T20:32:33.0450087Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:33.0450626Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:33.0451197Z module_map=module_map) 2025-05-07T20:32:33.0451569Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:33.0451939Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:33.0452202Z E ^ 2025-05-07T20:32:33.0452688Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:33.0453175Z 2025-05-07T20:32:33.0453626Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:33.0454202Z 2025-05-07T20:32:33.0454312Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:33.0454755Z self=, 2025-05-07T20:32:33.0455170Z T=128, 2025-05-07T20:32:33.0455354Z D=5120, 2025-05-07T20:32:33.0455539Z scale_ub=None, 2025-05-07T20:32:33.0455758Z contiguous=True, 2025-05-07T20:32:33.0455985Z compiled=True, 2025-05-07T20:32:33.0456178Z ) 2025-05-07T20:32:33.5743297Z W0507 20:32:33.570105 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:33.5744966Z W0507 20:32:33.570105 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] Traceback (most recent call last): 2025-05-07T20:32:33.5746446Z W0507 20:32:33.570105 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:33.5748022Z W0507 20:32:33.570105 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:33.5749537Z W0507 20:32:33.570105 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:33.5751398Z W0507 20:32:33.570105 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:33.5752837Z W0507 20:32:33.570105 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:33.5754354Z W0507 20:32:33.570105 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:33.5755915Z W0507 20:32:33.570105 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:33.5757301Z W0507 20:32:33.570105 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] generator.visit(fn.parse()) 2025-05-07T20:32:33.5758635Z W0507 20:32:33.570105 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:33.5759961Z W0507 20:32:33.570105 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ret = super().visit(node) 2025-05-07T20:32:33.5761094Z W0507 20:32:33.570105 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit 2025-05-07T20:32:33.5762209Z W0507 20:32:33.570105 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] return visitor(node) 2025-05-07T20:32:33.5763720Z W0507 20:32:33.570105 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:33.5765125Z W0507 20:32:33.570105 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:33.5766340Z W0507 20:32:33.570105 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit 2025-05-07T20:32:33.5767471Z W0507 20:32:33.570105 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] self.visit(item) 2025-05-07T20:32:33.5768764Z W0507 20:32:33.570105 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:33.5770258Z W0507 20:32:33.570105 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:33.5771415Z W0507 20:32:33.570105 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:33.5772401Z W0507 20:32:33.570105 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:33.5773194Z W0507 20:32:33.570105 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ^ 2025-05-07T20:32:33.5774297Z W0507 20:32:33.570105 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:33.7628846Z W0507 20:32:33.758944 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:33.7630266Z W0507 20:32:33.758944 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] Traceback (most recent call last): 2025-05-07T20:32:33.7636707Z W0507 20:32:33.758944 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:33.7638312Z W0507 20:32:33.758944 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:33.7639844Z W0507 20:32:33.758944 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:33.7641393Z W0507 20:32:33.758944 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:33.7642837Z W0507 20:32:33.758944 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:33.7644398Z W0507 20:32:33.758944 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:33.7645994Z W0507 20:32:33.758944 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:33.7647552Z W0507 20:32:33.758944 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] generator.visit(fn.parse()) 2025-05-07T20:32:33.7648902Z W0507 20:32:33.758944 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:33.7650233Z W0507 20:32:33.758944 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ret = super().visit(node) 2025-05-07T20:32:33.7651382Z W0507 20:32:33.758944 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit 2025-05-07T20:32:33.7652510Z W0507 20:32:33.758944 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] return visitor(node) 2025-05-07T20:32:33.7653865Z W0507 20:32:33.758944 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:33.7655280Z W0507 20:32:33.758944 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:33.7656493Z W0507 20:32:33.758944 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit 2025-05-07T20:32:33.7657634Z W0507 20:32:33.758944 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] self.visit(item) 2025-05-07T20:32:33.7658940Z W0507 20:32:33.758944 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:33.7660439Z W0507 20:32:33.758944 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:33.7661679Z W0507 20:32:33.758944 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:33.7662659Z W0507 20:32:33.758944 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:33.7663458Z W0507 20:32:33.758944 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ^ 2025-05-07T20:32:33.7664572Z W0507 20:32:33.758944 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:34.5817350Z self = 2025-05-07T20:32:34.5818143Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:34.5818536Z 2025-05-07T20:32:34.5818645Z @given( 2025-05-07T20:32:34.5818974Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:34.5819396Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:34.5819790Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:34.5820214Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:34.5820585Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:34.5820882Z ) 2025-05-07T20:32:34.5821252Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:34.5821724Z def test_silu_mul_quant( 2025-05-07T20:32:34.5821968Z self, 2025-05-07T20:32:34.5822157Z T: int, 2025-05-07T20:32:34.5822349Z D: int, 2025-05-07T20:32:34.5822568Z scale_ub: Optional[float], 2025-05-07T20:32:34.5822847Z contiguous: bool, 2025-05-07T20:32:34.5823298Z compiled: bool, 2025-05-07T20:32:34.5823523Z ) -> None: 2025-05-07T20:32:34.5823745Z torch.manual_seed(2025) 2025-05-07T20:32:34.5823996Z 2025-05-07T20:32:34.5824276Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:34.5824641Z 2025-05-07T20:32:34.5824838Z x_sign = torch.sign(x) 2025-05-07T20:32:34.5825135Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:34.5825467Z x = x_sign * x_clamp 2025-05-07T20:32:34.5825719Z x0 = x[:, :D] 2025-05-07T20:32:34.5825939Z x1 = x[:, D:] 2025-05-07T20:32:34.5826152Z 2025-05-07T20:32:34.5826341Z if contiguous: 2025-05-07T20:32:34.5826574Z x0 = x0.contiguous() 2025-05-07T20:32:34.5826843Z x1 = x1.contiguous() 2025-05-07T20:32:34.5827095Z 2025-05-07T20:32:34.5827288Z if scale_ub is not None: 2025-05-07T20:32:34.5827576Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:34.5827933Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:34.5828246Z ) 2025-05-07T20:32:34.5828438Z else: 2025-05-07T20:32:34.5828646Z scale_ub_tensor = None 2025-05-07T20:32:34.5828902Z 2025-05-07T20:32:34.5829128Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:34.5829455Z op = silu_mul_quant 2025-05-07T20:32:34.5829829Z if compiled: 2025-05-07T20:32:34.5830079Z op = torch.compile(op) 2025-05-07T20:32:34.5830383Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:34.5830663Z 2025-05-07T20:32:34.5830849Z y_fp8, y_scale = fn() 2025-05-07T20:32:34.5831129Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:34.5831425Z 2025-05-07T20:32:34.5831654Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:34.5831998Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:34.5832303Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:34.5832622Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:34.5832992Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:34.5833438Z 2025-05-07T20:32:34.5833644Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:34.5833843Z 2025-05-07T20:32:34.5833942Z moe/activation_test.py:126: 2025-05-07T20:32:34.5834247Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:34.5834595Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:34.5834924Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:34.5835773Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:34.5836589Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:34.5837171Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:34.5837906Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:34.5838651Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:34.5839420Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:34.5840227Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:34.5841018Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:34.5841795Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:34.5842473Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:34.5843102Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:34.5843738Z fn() 2025-05-07T20:32:34.5844266Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:34.5844936Z self.fn.run( 2025-05-07T20:32:34.5845413Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:34.5845971Z kernel = self.compile( 2025-05-07T20:32:34.5846539Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:34.5847224Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:34.5847633Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:34.5847881Z 2025-05-07T20:32:34.5848094Z self = 2025-05-07T20:32:34.5849258Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:34.5850785Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3dd5944d30>} 2025-05-07T20:32:34.5852246Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:34.5853342Z context = 2025-05-07T20:32:34.5853646Z 2025-05-07T20:32:34.5853817Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:34.5854361Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:34.5854851Z module_map=module_map) 2025-05-07T20:32:34.5855225Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:34.5855590Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:34.5855941Z E ^ 2025-05-07T20:32:34.5856435Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:34.5856925Z 2025-05-07T20:32:34.5857381Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:34.5857932Z 2025-05-07T20:32:34.5858040Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:34.5858454Z self=, 2025-05-07T20:32:34.5858871Z T=4096, 2025-05-07T20:32:34.5859054Z D=5120, 2025-05-07T20:32:34.5859242Z scale_ub=None, 2025-05-07T20:32:34.5859450Z contiguous=True, 2025-05-07T20:32:34.5859671Z compiled=True, 2025-05-07T20:32:34.5859870Z ) 2025-05-07T20:32:35.1192014Z W0507 20:32:35.115098 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:35.1193479Z W0507 20:32:35.115098 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] Traceback (most recent call last): 2025-05-07T20:32:35.1194961Z W0507 20:32:35.115098 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:35.1196540Z W0507 20:32:35.115098 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:35.1198055Z W0507 20:32:35.115098 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:35.1199761Z W0507 20:32:35.115098 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.1201197Z W0507 20:32:35.115098 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:35.1202715Z W0507 20:32:35.115098 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.1204277Z W0507 20:32:35.115098 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:35.1205713Z W0507 20:32:35.115098 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] generator.visit(fn.parse()) 2025-05-07T20:32:35.1207055Z W0507 20:32:35.115098 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:35.1208390Z W0507 20:32:35.115098 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ret = super().visit(node) 2025-05-07T20:32:35.1209531Z W0507 20:32:35.115098 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit 2025-05-07T20:32:35.1210649Z W0507 20:32:35.115098 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] return visitor(node) 2025-05-07T20:32:35.1212105Z W0507 20:32:35.115098 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:35.1213522Z W0507 20:32:35.115098 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:35.1214740Z W0507 20:32:35.115098 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit 2025-05-07T20:32:35.1215879Z W0507 20:32:35.115098 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] self.visit(item) 2025-05-07T20:32:35.1217171Z W0507 20:32:35.115098 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:35.1218675Z W0507 20:32:35.115098 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:35.1219828Z W0507 20:32:35.115098 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.1220815Z W0507 20:32:35.115098 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:35.1221612Z W0507 20:32:35.115098 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ^ 2025-05-07T20:32:35.1222718Z W0507 20:32:35.115098 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.3110236Z W0507 20:32:35.306991 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:35.3111849Z W0507 20:32:35.306991 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] Traceback (most recent call last): 2025-05-07T20:32:35.3113320Z W0507 20:32:35.306991 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:35.3114936Z W0507 20:32:35.306991 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:35.3116457Z W0507 20:32:35.306991 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:35.3117993Z W0507 20:32:35.306991 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.3119432Z W0507 20:32:35.306991 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:35.3120943Z W0507 20:32:35.306991 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.3122504Z W0507 20:32:35.306991 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:35.3123880Z W0507 20:32:35.306991 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] generator.visit(fn.parse()) 2025-05-07T20:32:35.3125370Z W0507 20:32:35.306991 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:35.3126700Z W0507 20:32:35.306991 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ret = super().visit(node) 2025-05-07T20:32:35.3127833Z W0507 20:32:35.306991 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit 2025-05-07T20:32:35.3128954Z W0507 20:32:35.306991 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] return visitor(node) 2025-05-07T20:32:35.3130301Z W0507 20:32:35.306991 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:35.3131725Z W0507 20:32:35.306991 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:35.3132939Z W0507 20:32:35.306991 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit 2025-05-07T20:32:35.3134077Z W0507 20:32:35.306991 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] self.visit(item) 2025-05-07T20:32:35.3135374Z W0507 20:32:35.306991 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:35.3136955Z W0507 20:32:35.306991 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:35.3138112Z W0507 20:32:35.306991 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.3139099Z W0507 20:32:35.306991 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:35.3139894Z W0507 20:32:35.306991 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ^ 2025-05-07T20:32:35.3141003Z W0507 20:32:35.306991 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.9742627Z self = 2025-05-07T20:32:35.9744042Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:35.9744744Z 2025-05-07T20:32:35.9744944Z @given( 2025-05-07T20:32:35.9745383Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.9745744Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.9746061Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.9746402Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.9746742Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.9747064Z ) 2025-05-07T20:32:35.9747425Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.9747888Z def test_silu_mul_quant( 2025-05-07T20:32:35.9748137Z self, 2025-05-07T20:32:35.9748335Z T: int, 2025-05-07T20:32:35.9748526Z D: int, 2025-05-07T20:32:35.9748751Z scale_ub: Optional[float], 2025-05-07T20:32:35.9749026Z contiguous: bool, 2025-05-07T20:32:35.9749269Z compiled: bool, 2025-05-07T20:32:35.9749500Z ) -> None: 2025-05-07T20:32:35.9749860Z torch.manual_seed(2025) 2025-05-07T20:32:35.9750106Z 2025-05-07T20:32:35.9750565Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.9750928Z 2025-05-07T20:32:35.9751112Z x_sign = torch.sign(x) 2025-05-07T20:32:35.9751410Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.9751728Z x = x_sign * x_clamp 2025-05-07T20:32:35.9751971Z x0 = x[:, :D] 2025-05-07T20:32:35.9752183Z x1 = x[:, D:] 2025-05-07T20:32:35.9752386Z 2025-05-07T20:32:35.9752571Z if contiguous: 2025-05-07T20:32:35.9752798Z x0 = x0.contiguous() 2025-05-07T20:32:35.9753058Z x1 = x1.contiguous() 2025-05-07T20:32:35.9753297Z 2025-05-07T20:32:35.9753481Z if scale_ub is not None: 2025-05-07T20:32:35.9753757Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.9754107Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.9754414Z ) 2025-05-07T20:32:35.9754606Z else: 2025-05-07T20:32:35.9754815Z scale_ub_tensor = None 2025-05-07T20:32:35.9755071Z 2025-05-07T20:32:35.9755302Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.9755619Z op = silu_mul_quant 2025-05-07T20:32:35.9755863Z if compiled: 2025-05-07T20:32:35.9756110Z op = torch.compile(op) 2025-05-07T20:32:35.9756411Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.9756700Z 2025-05-07T20:32:35.9756884Z y_fp8, y_scale = fn() 2025-05-07T20:32:35.9757169Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:35.9757467Z 2025-05-07T20:32:35.9757697Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.9758038Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:35.9758333Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:35.9758780Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:35.9759150Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:35.9759474Z 2025-05-07T20:32:35.9759674Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:35.9759882Z 2025-05-07T20:32:35.9759980Z moe/activation_test.py:126: 2025-05-07T20:32:35.9760284Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.9760630Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:35.9760957Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:35.9761803Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:35.9762615Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:35.9763182Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.9763917Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.9764656Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:35.9765429Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:35.9766233Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:35.9767032Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:35.9767807Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:35.9768488Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:35.9769118Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:35.9769671Z fn() 2025-05-07T20:32:35.9770203Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:35.9770898Z self.fn.run( 2025-05-07T20:32:35.9771387Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.9771949Z kernel = self.compile( 2025-05-07T20:32:35.9772516Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.9773208Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.9773617Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.9773858Z 2025-05-07T20:32:35.9774073Z self = 2025-05-07T20:32:35.9775238Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.9776755Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3dd5944670>} 2025-05-07T20:32:35.9778223Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.9779328Z context = 2025-05-07T20:32:35.9779631Z 2025-05-07T20:32:35.9779802Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.9780339Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.9780830Z module_map=module_map) 2025-05-07T20:32:35.9781284Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.9781648Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:35.9781911Z E ^ 2025-05-07T20:32:35.9782404Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.9783076Z 2025-05-07T20:32:35.9783530Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.9784084Z 2025-05-07T20:32:35.9784184Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.9784607Z self=, 2025-05-07T20:32:35.9785054Z T=16384, 2025-05-07T20:32:35.9785268Z D=5120, 2025-05-07T20:32:35.9785454Z scale_ub=None, 2025-05-07T20:32:35.9785661Z contiguous=True, 2025-05-07T20:32:35.9785883Z compiled=True, 2025-05-07T20:32:35.9786079Z ) 2025-05-07T20:32:36.0205788Z W0507 20:32:36.019138 87499 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] torch._dynamo hit config.recompile_limit (8) 2025-05-07T20:32:36.0207387Z W0507 20:32:36.019138 87499 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] function: 'silu_mul_quant' (/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:55) 2025-05-07T20:32:36.0208850Z W0507 20:32:36.019138 87499 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] last reason: 0/7: tensor 'x0' stride mismatch at index 0. expected 5120, actual 10240 2025-05-07T20:32:36.0209920Z W0507 20:32:36.019138 87499 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To log all recompilation reasons, use TORCH_LOGS="recompiles". 2025-05-07T20:32:36.0211114Z W0507 20:32:36.019138 87499 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html. 2025-05-07T20:32:36.1427696Z self = 2025-05-07T20:32:36.1428489Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:36.1428881Z 2025-05-07T20:32:36.1429196Z @given( 2025-05-07T20:32:36.1429480Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:36.1429865Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:36.1430174Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:36.1430514Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:36.1430843Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:36.1431136Z ) 2025-05-07T20:32:36.1431493Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:36.1431953Z def test_silu_mul_quant( 2025-05-07T20:32:36.1432196Z self, 2025-05-07T20:32:36.1432384Z T: int, 2025-05-07T20:32:36.1432571Z D: int, 2025-05-07T20:32:36.1432793Z scale_ub: Optional[float], 2025-05-07T20:32:36.1433067Z contiguous: bool, 2025-05-07T20:32:36.1433300Z compiled: bool, 2025-05-07T20:32:36.1433525Z ) -> None: 2025-05-07T20:32:36.1433743Z torch.manual_seed(2025) 2025-05-07T20:32:36.1433981Z 2025-05-07T20:32:36.1434252Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:36.1434611Z 2025-05-07T20:32:36.1434796Z x_sign = torch.sign(x) 2025-05-07T20:32:36.1435135Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:36.1435453Z x = x_sign * x_clamp 2025-05-07T20:32:36.1435693Z x0 = x[:, :D] 2025-05-07T20:32:36.1435909Z x1 = x[:, D:] 2025-05-07T20:32:36.1436115Z 2025-05-07T20:32:36.1436297Z if contiguous: 2025-05-07T20:32:36.1436521Z x0 = x0.contiguous() 2025-05-07T20:32:36.1436777Z x1 = x1.contiguous() 2025-05-07T20:32:36.1437013Z 2025-05-07T20:32:36.1437196Z if scale_ub is not None: 2025-05-07T20:32:36.1437602Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:36.1437946Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:36.1438257Z ) 2025-05-07T20:32:36.1438456Z else: 2025-05-07T20:32:36.1438667Z scale_ub_tensor = None 2025-05-07T20:32:36.1438916Z 2025-05-07T20:32:36.1439146Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:36.1439467Z op = silu_mul_quant 2025-05-07T20:32:36.1439714Z if compiled: 2025-05-07T20:32:36.1439960Z op = torch.compile(op) 2025-05-07T20:32:36.1440258Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:36.1440539Z 2025-05-07T20:32:36.1440721Z y_fp8, y_scale = fn() 2025-05-07T20:32:36.1441006Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:36.1441302Z 2025-05-07T20:32:36.1441528Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:36.1441875Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:36.1442176Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:36.1442489Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:36.1442865Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:36.1443184Z 2025-05-07T20:32:36.1443379Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:36.1443587Z 2025-05-07T20:32:36.1443684Z moe/activation_test.py:126: 2025-05-07T20:32:36.1443984Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:36.1444334Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:36.1444663Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:36.1445513Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:36.1446326Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:36.1446904Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:36.1447634Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:36.1448458Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:36.1449236Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:36.1450037Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:36.1450840Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:36.1451624Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:36.1452309Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:36.1452950Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:36.1453502Z fn() 2025-05-07T20:32:36.1454045Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:36.1454672Z self.fn.run( 2025-05-07T20:32:36.1455168Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:36.1455737Z kernel = self.compile( 2025-05-07T20:32:36.1456316Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:36.1457012Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:36.1457429Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:36.1457673Z 2025-05-07T20:32:36.1457893Z self = 2025-05-07T20:32:36.1459181Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:36.1460701Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3dd5a87dc0>} 2025-05-07T20:32:36.1462168Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:36.1463275Z context = 2025-05-07T20:32:36.1463578Z 2025-05-07T20:32:36.1463751Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:36.1464302Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:36.1464788Z module_map=module_map) 2025-05-07T20:32:36.1470478Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:36.1470863Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:36.1471141Z E ^ 2025-05-07T20:32:36.1471636Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:36.1472136Z 2025-05-07T20:32:36.1472586Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:36.1473147Z 2025-05-07T20:32:36.1473251Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:36.1473683Z self=, 2025-05-07T20:32:36.1474101Z T=1, 2025-05-07T20:32:36.1474283Z D=5120, 2025-05-07T20:32:36.1474482Z scale_ub=1200.0, 2025-05-07T20:32:36.1474704Z contiguous=True, 2025-05-07T20:32:36.1474936Z compiled=True, 2025-05-07T20:32:36.1475174Z ) 2025-05-07T20:32:36.3186048Z self = 2025-05-07T20:32:36.3187024Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:36.3187360Z 2025-05-07T20:32:36.3187437Z @given( 2025-05-07T20:32:36.3187672Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:36.3187991Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:36.3188301Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:36.3188639Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:36.3188972Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:36.3189262Z ) 2025-05-07T20:32:36.3189621Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:36.3190202Z def test_silu_mul_quant( 2025-05-07T20:32:36.3190447Z self, 2025-05-07T20:32:36.3190642Z T: int, 2025-05-07T20:32:36.3190840Z D: int, 2025-05-07T20:32:36.3191057Z scale_ub: Optional[float], 2025-05-07T20:32:36.3191329Z contiguous: bool, 2025-05-07T20:32:36.3191578Z compiled: bool, 2025-05-07T20:32:36.3191806Z ) -> None: 2025-05-07T20:32:36.3192011Z torch.manual_seed(2025) 2025-05-07T20:32:36.3192258Z 2025-05-07T20:32:36.3192535Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:36.3192892Z 2025-05-07T20:32:36.3193085Z x_sign = torch.sign(x) 2025-05-07T20:32:36.3193377Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:36.3193690Z x = x_sign * x_clamp 2025-05-07T20:32:36.3193932Z x0 = x[:, :D] 2025-05-07T20:32:36.3194147Z x1 = x[:, D:] 2025-05-07T20:32:36.3194349Z 2025-05-07T20:32:36.3194527Z if contiguous: 2025-05-07T20:32:36.3194760Z x0 = x0.contiguous() 2025-05-07T20:32:36.3195014Z x1 = x1.contiguous() 2025-05-07T20:32:36.3195391Z 2025-05-07T20:32:36.3195583Z if scale_ub is not None: 2025-05-07T20:32:36.3195852Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:36.3196198Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:36.3196516Z ) 2025-05-07T20:32:36.3196702Z else: 2025-05-07T20:32:36.3196913Z scale_ub_tensor = None 2025-05-07T20:32:36.3197170Z 2025-05-07T20:32:36.3197398Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:36.3197721Z op = silu_mul_quant 2025-05-07T20:32:36.3197972Z if compiled: 2025-05-07T20:32:36.3198224Z op = torch.compile(op) 2025-05-07T20:32:36.3198525Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:36.3198812Z 2025-05-07T20:32:36.3199009Z > y_fp8, y_scale = fn() 2025-05-07T20:32:36.3199176Z 2025-05-07T20:32:36.3199276Z moe/activation_test.py:117: 2025-05-07T20:32:36.3199582Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:36.3199935Z moe/activation_test.py:115: in fn 2025-05-07T20:32:36.3200218Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:36.3200815Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:36.3201416Z return fn(*args, **kwargs) 2025-05-07T20:32:36.3202124Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:36.3202863Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:36.3203433Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:36.3204172Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:36.3204881Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:36.3205451Z kernel = self.compile( 2025-05-07T20:32:36.3206022Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:36.3206805Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:36.3207212Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:36.3207457Z 2025-05-07T20:32:36.3207668Z self = 2025-05-07T20:32:36.3208838Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:36.3210340Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3dd55e9ca0>} 2025-05-07T20:32:36.3211818Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:36.3212922Z context = 2025-05-07T20:32:36.3213232Z 2025-05-07T20:32:36.3213398Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:36.3213945Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:36.3214435Z module_map=module_map) 2025-05-07T20:32:36.3214803Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:36.3215162Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:36.3215423Z E ^ 2025-05-07T20:32:36.3215907Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:36.3216481Z 2025-05-07T20:32:36.3216926Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:36.3217485Z 2025-05-07T20:32:36.3217591Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:36.3218012Z self=, 2025-05-07T20:32:36.3218423Z T=1, 2025-05-07T20:32:36.3218604Z D=5120, 2025-05-07T20:32:36.3218792Z scale_ub=None, 2025-05-07T20:32:36.3218996Z contiguous=False, 2025-05-07T20:32:36.3219216Z compiled=True, 2025-05-07T20:32:36.3219415Z ) 2025-05-07T20:32:36.4028334Z self = 2025-05-07T20:32:36.4029077Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:36.4029459Z 2025-05-07T20:32:36.4029563Z @given( 2025-05-07T20:32:36.4030046Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:36.4030469Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:36.4030792Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:36.4031127Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:36.4031465Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:36.4031750Z ) 2025-05-07T20:32:36.4032110Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:36.4032570Z def test_silu_mul_quant( 2025-05-07T20:32:36.4032806Z self, 2025-05-07T20:32:36.4032995Z T: int, 2025-05-07T20:32:36.4033186Z D: int, 2025-05-07T20:32:36.4033395Z scale_ub: Optional[float], 2025-05-07T20:32:36.4033667Z contiguous: bool, 2025-05-07T20:32:36.4033907Z compiled: bool, 2025-05-07T20:32:36.4034123Z ) -> None: 2025-05-07T20:32:36.4034338Z torch.manual_seed(2025) 2025-05-07T20:32:36.4034577Z 2025-05-07T20:32:36.4034848Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:36.4035257Z 2025-05-07T20:32:36.4035445Z x_sign = torch.sign(x) 2025-05-07T20:32:36.4035733Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:36.4036043Z x = x_sign * x_clamp 2025-05-07T20:32:36.4036447Z x0 = x[:, :D] 2025-05-07T20:32:36.4036665Z x1 = x[:, D:] 2025-05-07T20:32:36.4036864Z 2025-05-07T20:32:36.4037046Z if contiguous: 2025-05-07T20:32:36.4037272Z x0 = x0.contiguous() 2025-05-07T20:32:36.4037527Z x1 = x1.contiguous() 2025-05-07T20:32:36.4037770Z 2025-05-07T20:32:36.4037952Z if scale_ub is not None: 2025-05-07T20:32:36.4038220Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:36.4038562Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:36.4038887Z ) 2025-05-07T20:32:36.4039073Z else: 2025-05-07T20:32:36.4039277Z scale_ub_tensor = None 2025-05-07T20:32:36.4039531Z 2025-05-07T20:32:36.4039758Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:36.4040085Z op = silu_mul_quant 2025-05-07T20:32:36.4040334Z if compiled: 2025-05-07T20:32:36.4040575Z op = torch.compile(op) 2025-05-07T20:32:36.4040878Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:36.4041159Z 2025-05-07T20:32:36.4041346Z y_fp8, y_scale = fn() 2025-05-07T20:32:36.4041628Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:36.4041929Z 2025-05-07T20:32:36.4042160Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:36.4042495Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:36.4042795Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:36.4043116Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:36.4043480Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:36.4043802Z 2025-05-07T20:32:36.4043998Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:36.4044324Z 2025-05-07T20:32:36.4044428Z moe/activation_test.py:126: 2025-05-07T20:32:36.4044723Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:36.4045074Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:36.4045411Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:36.4046252Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:36.4047065Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:36.4047633Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:36.4048361Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:36.4049088Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:36.4049861Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:36.4050675Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:36.4051468Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:36.4052262Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:36.4052946Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:36.4053581Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:36.4054132Z fn() 2025-05-07T20:32:36.4054663Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:36.4055283Z self.fn.run( 2025-05-07T20:32:36.4055770Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:36.4056335Z kernel = self.compile( 2025-05-07T20:32:36.4057016Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:36.4057710Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:36.4058123Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:36.4058364Z 2025-05-07T20:32:36.4058576Z self = 2025-05-07T20:32:36.4059746Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:36.4061250Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3dd59ebdc0>} 2025-05-07T20:32:36.4062728Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:36.4063837Z context = 2025-05-07T20:32:36.4064146Z 2025-05-07T20:32:36.4064315Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:36.4064865Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:36.4065361Z module_map=module_map) 2025-05-07T20:32:36.4065728Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:36.4066087Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:36.4066358Z E ^ 2025-05-07T20:32:36.4066840Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:36.4067440Z 2025-05-07T20:32:36.4067890Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:36.4068445Z 2025-05-07T20:32:36.4068560Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:36.4068978Z self=, 2025-05-07T20:32:36.4069396Z T=1, 2025-05-07T20:32:36.4069572Z D=5120, 2025-05-07T20:32:36.4069823Z scale_ub=None, 2025-05-07T20:32:36.4070029Z contiguous=True, 2025-05-07T20:32:36.4070253Z compiled=False, 2025-05-07T20:32:36.4070450Z ) 2025-05-07T20:32:36.7793534Z self = 2025-05-07T20:32:36.7794336Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:36.7794714Z 2025-05-07T20:32:36.7794828Z @given( 2025-05-07T20:32:36.7795147Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:36.7795511Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:36.7795821Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:36.7796163Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:36.7796492Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:36.7796781Z ) 2025-05-07T20:32:36.7797138Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:36.7797599Z def test_silu_mul_quant( 2025-05-07T20:32:36.7797841Z self, 2025-05-07T20:32:36.7798030Z T: int, 2025-05-07T20:32:36.7798218Z D: int, 2025-05-07T20:32:36.7798435Z scale_ub: Optional[float], 2025-05-07T20:32:36.7798705Z contiguous: bool, 2025-05-07T20:32:36.7798938Z compiled: bool, 2025-05-07T20:32:36.7799159Z ) -> None: 2025-05-07T20:32:36.7799367Z torch.manual_seed(2025) 2025-05-07T20:32:36.7799603Z 2025-05-07T20:32:36.7799877Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:36.7800230Z 2025-05-07T20:32:36.7800411Z x_sign = torch.sign(x) 2025-05-07T20:32:36.7800873Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:36.7801192Z x = x_sign * x_clamp 2025-05-07T20:32:36.7801431Z x0 = x[:, :D] 2025-05-07T20:32:36.7801642Z x1 = x[:, D:] 2025-05-07T20:32:36.7801847Z 2025-05-07T20:32:36.7802029Z if contiguous: 2025-05-07T20:32:36.7802252Z x0 = x0.contiguous() 2025-05-07T20:32:36.7802512Z x1 = x1.contiguous() 2025-05-07T20:32:36.7802757Z 2025-05-07T20:32:36.7802945Z if scale_ub is not None: 2025-05-07T20:32:36.7803218Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:36.7803560Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:36.7803870Z ) 2025-05-07T20:32:36.7804059Z else: 2025-05-07T20:32:36.7804270Z scale_ub_tensor = None 2025-05-07T20:32:36.7804518Z 2025-05-07T20:32:36.7804743Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:36.7805065Z op = silu_mul_quant 2025-05-07T20:32:36.7805317Z if compiled: 2025-05-07T20:32:36.7805562Z op = torch.compile(op) 2025-05-07T20:32:36.7805861Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:36.7806144Z 2025-05-07T20:32:36.7806328Z > y_fp8, y_scale = fn() 2025-05-07T20:32:36.7806496Z 2025-05-07T20:32:36.7806592Z moe/activation_test.py:117: 2025-05-07T20:32:36.7806888Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:36.7807229Z moe/activation_test.py:115: in fn 2025-05-07T20:32:36.7807508Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:36.7808246Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:36.7809112Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:36.7809670Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:36.7810402Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:36.7811109Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:36.7811670Z kernel = self.compile( 2025-05-07T20:32:36.7812239Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:36.7812938Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:36.7813348Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:36.7813586Z 2025-05-07T20:32:36.7813796Z self = 2025-05-07T20:32:36.7814967Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:36.7816477Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3dd54c78b0>} 2025-05-07T20:32:36.7817947Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:36.7819052Z context = 2025-05-07T20:32:36.7819356Z 2025-05-07T20:32:36.7819523Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:36.7820065Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:36.7820556Z module_map=module_map) 2025-05-07T20:32:36.7820924Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:36.7821279Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:36.7821618Z E ^ 2025-05-07T20:32:36.7822104Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:36.7822596Z 2025-05-07T20:32:36.7823041Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:36.7823599Z 2025-05-07T20:32:36.7823699Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:36.7824128Z self=, 2025-05-07T20:32:36.7824543Z T=128, 2025-05-07T20:32:36.7824728Z D=5120, 2025-05-07T20:32:36.7824919Z scale_ub=None, 2025-05-07T20:32:36.7825129Z contiguous=False, 2025-05-07T20:32:36.7825360Z compiled=True, 2025-05-07T20:32:36.7825573Z ) 2025-05-07T20:32:36.7825895Z self = 2025-05-07T20:32:36.7826420Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:36.7826728Z 2025-05-07T20:32:36.7826803Z @given( 2025-05-07T20:32:36.7827032Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:36.7827359Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:36.7827666Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:36.7828008Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:36.7828343Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:36.7828630Z ) 2025-05-07T20:32:36.7828989Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:36.7829451Z def test_silu_mul_quant( 2025-05-07T20:32:36.7829802Z self, 2025-05-07T20:32:36.7829998Z T: int, 2025-05-07T20:32:36.7830278Z D: int, 2025-05-07T20:32:36.7830484Z scale_ub: Optional[float], 2025-05-07T20:32:36.7830755Z contiguous: bool, 2025-05-07T20:32:36.7830994Z compiled: bool, 2025-05-07T20:32:36.7831220Z ) -> None: 2025-05-07T20:32:36.7831426Z torch.manual_seed(2025) 2025-05-07T20:32:36.7831670Z 2025-05-07T20:32:36.7831943Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:36.7832292Z 2025-05-07T20:32:36.7832480Z x_sign = torch.sign(x) 2025-05-07T20:32:36.7832770Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:36.7833080Z x = x_sign * x_clamp 2025-05-07T20:32:36.7833325Z x0 = x[:, :D] 2025-05-07T20:32:36.7833537Z x1 = x[:, D:] 2025-05-07T20:32:36.7833735Z 2025-05-07T20:32:36.7833914Z if contiguous: 2025-05-07T20:32:36.7834141Z x0 = x0.contiguous() 2025-05-07T20:32:36.7834393Z x1 = x1.contiguous() 2025-05-07T20:32:36.7834631Z 2025-05-07T20:32:36.7834819Z if scale_ub is not None: 2025-05-07T20:32:36.7835087Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:36.7835425Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:36.7835742Z ) 2025-05-07T20:32:36.7835928Z else: 2025-05-07T20:32:36.7836126Z scale_ub_tensor = None 2025-05-07T20:32:36.7836379Z 2025-05-07T20:32:36.7836605Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:36.7836926Z op = silu_mul_quant 2025-05-07T20:32:36.7837173Z if compiled: 2025-05-07T20:32:36.7837414Z op = torch.compile(op) 2025-05-07T20:32:36.7837708Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:36.7837987Z 2025-05-07T20:32:36.7838175Z > y_fp8, y_scale = fn() 2025-05-07T20:32:36.7838339Z 2025-05-07T20:32:36.7838434Z moe/activation_test.py:117: 2025-05-07T20:32:36.7838731Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:36.7839080Z moe/activation_test.py:115: in fn 2025-05-07T20:32:36.7839364Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:36.7840026Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:36.7840619Z return fn(*args, **kwargs) 2025-05-07T20:32:36.7841318Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:36.7842048Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:36.7842609Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:36.7843334Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:36.7844036Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:36.7844594Z kernel = self.compile( 2025-05-07T20:32:36.7845164Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:36.7845861Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:36.7846269Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:36.7846513Z 2025-05-07T20:32:36.7846724Z self = 2025-05-07T20:32:36.7847885Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:36.7849383Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3dd510b5e0>} 2025-05-07T20:32:36.7850842Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:36.7852024Z context = 2025-05-07T20:32:36.7852329Z 2025-05-07T20:32:36.7852496Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:36.7853041Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:36.7853529Z module_map=module_map) 2025-05-07T20:32:36.7853893Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:36.7854252Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:36.7854513Z E ^ 2025-05-07T20:32:36.7855001Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:36.7855545Z 2025-05-07T20:32:36.7855990Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:36.7856554Z 2025-05-07T20:32:36.7856657Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:36.7857083Z self=, 2025-05-07T20:32:36.7857493Z T=128, 2025-05-07T20:32:36.7857672Z D=7168, 2025-05-07T20:32:36.7857863Z scale_ub=1200.0, 2025-05-07T20:32:36.7858085Z contiguous=False, 2025-05-07T20:32:36.7858318Z compiled=False, 2025-05-07T20:32:36.7858526Z ) 2025-05-07T20:32:36.9403018Z self = 2025-05-07T20:32:36.9403878Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:36.9404288Z 2025-05-07T20:32:36.9404411Z @given( 2025-05-07T20:32:36.9413680Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:36.9414346Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:36.9414992Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:36.9415402Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:36.9415738Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:36.9416220Z ) 2025-05-07T20:32:36.9416589Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:36.9417070Z def test_silu_mul_quant( 2025-05-07T20:32:36.9417310Z self, 2025-05-07T20:32:36.9417506Z T: int, 2025-05-07T20:32:36.9417705Z D: int, 2025-05-07T20:32:36.9417916Z scale_ub: Optional[float], 2025-05-07T20:32:36.9418199Z contiguous: bool, 2025-05-07T20:32:36.9418444Z compiled: bool, 2025-05-07T20:32:36.9418660Z ) -> None: 2025-05-07T20:32:36.9418884Z torch.manual_seed(2025) 2025-05-07T20:32:36.9419125Z 2025-05-07T20:32:36.9419402Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:36.9419764Z 2025-05-07T20:32:36.9419960Z x_sign = torch.sign(x) 2025-05-07T20:32:36.9420250Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:36.9420563Z x = x_sign * x_clamp 2025-05-07T20:32:36.9420812Z x0 = x[:, :D] 2025-05-07T20:32:36.9421026Z x1 = x[:, D:] 2025-05-07T20:32:36.9421228Z 2025-05-07T20:32:36.9421416Z if contiguous: 2025-05-07T20:32:36.9421643Z x0 = x0.contiguous() 2025-05-07T20:32:36.9421897Z x1 = x1.contiguous() 2025-05-07T20:32:36.9422137Z 2025-05-07T20:32:36.9422328Z if scale_ub is not None: 2025-05-07T20:32:36.9422600Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:36.9422948Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:36.9423264Z ) 2025-05-07T20:32:36.9423448Z else: 2025-05-07T20:32:36.9423657Z scale_ub_tensor = None 2025-05-07T20:32:36.9423916Z 2025-05-07T20:32:36.9424139Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:36.9424595Z op = silu_mul_quant 2025-05-07T20:32:36.9424850Z if compiled: 2025-05-07T20:32:36.9425091Z op = torch.compile(op) 2025-05-07T20:32:36.9425398Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:36.9425682Z 2025-05-07T20:32:36.9425865Z > y_fp8, y_scale = fn() 2025-05-07T20:32:36.9426037Z 2025-05-07T20:32:36.9426134Z moe/activation_test.py:117: 2025-05-07T20:32:36.9426436Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:36.9426777Z moe/activation_test.py:115: in fn 2025-05-07T20:32:36.9427056Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:36.9427797Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:36.9428539Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:36.9429098Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:36.9429928Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:36.9430642Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:36.9431207Z kernel = self.compile( 2025-05-07T20:32:36.9431769Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:36.9432461Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:36.9432867Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:36.9433107Z 2025-05-07T20:32:36.9433320Z self = 2025-05-07T20:32:36.9434494Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:36.9436093Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3dd4add430>} 2025-05-07T20:32:36.9437563Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:36.9438667Z context = 2025-05-07T20:32:36.9438972Z 2025-05-07T20:32:36.9439141Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:36.9439687Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:36.9440175Z module_map=module_map) 2025-05-07T20:32:36.9440545Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:36.9440903Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:36.9441170Z E ^ 2025-05-07T20:32:36.9441667Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:36.9442156Z 2025-05-07T20:32:36.9442603Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:36.9443161Z 2025-05-07T20:32:36.9443260Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:36.9443682Z self=, 2025-05-07T20:32:36.9444093Z T=128, 2025-05-07T20:32:36.9444268Z D=5120, 2025-05-07T20:32:36.9444454Z scale_ub=None, 2025-05-07T20:32:36.9444669Z contiguous=False, 2025-05-07T20:32:36.9444889Z compiled=False, 2025-05-07T20:32:36.9445092Z ) 2025-05-07T20:32:36.9445415Z self = 2025-05-07T20:32:36.9446015Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:36.9446304Z 2025-05-07T20:32:36.9446379Z @given( 2025-05-07T20:32:36.9446613Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:36.9446924Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:36.9447234Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:36.9447568Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:36.9447904Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:36.9448187Z ) 2025-05-07T20:32:36.9448542Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:36.9449002Z def test_silu_mul_quant( 2025-05-07T20:32:36.9449240Z self, 2025-05-07T20:32:36.9449430Z T: int, 2025-05-07T20:32:36.9449627Z D: int, 2025-05-07T20:32:36.9449840Z scale_ub: Optional[float], 2025-05-07T20:32:36.9450113Z contiguous: bool, 2025-05-07T20:32:36.9450361Z compiled: bool, 2025-05-07T20:32:36.9450576Z ) -> None: 2025-05-07T20:32:36.9450786Z torch.manual_seed(2025) 2025-05-07T20:32:36.9451027Z 2025-05-07T20:32:36.9451300Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:36.9451658Z 2025-05-07T20:32:36.9451851Z x_sign = torch.sign(x) 2025-05-07T20:32:36.9452144Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:36.9452455Z x = x_sign * x_clamp 2025-05-07T20:32:36.9452692Z x0 = x[:, :D] 2025-05-07T20:32:36.9452906Z x1 = x[:, D:] 2025-05-07T20:32:36.9453108Z 2025-05-07T20:32:36.9453292Z if contiguous: 2025-05-07T20:32:36.9453528Z x0 = x0.contiguous() 2025-05-07T20:32:36.9453781Z x1 = x1.contiguous() 2025-05-07T20:32:36.9454019Z 2025-05-07T20:32:36.9454209Z if scale_ub is not None: 2025-05-07T20:32:36.9454475Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:36.9454821Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:36.9455139Z ) 2025-05-07T20:32:36.9455348Z else: 2025-05-07T20:32:36.9455578Z scale_ub_tensor = None 2025-05-07T20:32:36.9455913Z 2025-05-07T20:32:36.9456139Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:36.9456461Z op = silu_mul_quant 2025-05-07T20:32:36.9456711Z if compiled: 2025-05-07T20:32:36.9456958Z op = torch.compile(op) 2025-05-07T20:32:36.9457252Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:36.9457528Z 2025-05-07T20:32:36.9457717Z > y_fp8, y_scale = fn() 2025-05-07T20:32:36.9457881Z 2025-05-07T20:32:36.9457976Z moe/activation_test.py:117: 2025-05-07T20:32:36.9458272Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:36.9458614Z moe/activation_test.py:115: in fn 2025-05-07T20:32:36.9458897Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:36.9459636Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:36.9460381Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:36.9460942Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:36.9461663Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:36.9462367Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:36.9467697Z kernel = self.compile( 2025-05-07T20:32:36.9468271Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:36.9468968Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:36.9469378Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:36.9469799Z 2025-05-07T20:32:36.9470017Z self = 2025-05-07T20:32:36.9471187Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:36.9472682Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3dd4addca0>} 2025-05-07T20:32:36.9474151Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:36.9475273Z context = 2025-05-07T20:32:36.9475575Z 2025-05-07T20:32:36.9475744Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:36.9476296Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:36.9476784Z module_map=module_map) 2025-05-07T20:32:36.9477150Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:36.9477505Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:36.9477764Z E ^ 2025-05-07T20:32:36.9478255Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:36.9478742Z 2025-05-07T20:32:36.9479189Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:36.9479745Z 2025-05-07T20:32:36.9479846Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:36.9480264Z self=, 2025-05-07T20:32:36.9480687Z T=128, 2025-05-07T20:32:36.9480872Z D=5120, 2025-05-07T20:32:36.9481059Z scale_ub=1200.0, 2025-05-07T20:32:36.9481281Z contiguous=True, 2025-05-07T20:32:36.9481498Z compiled=False, 2025-05-07T20:32:36.9481697Z ) 2025-05-07T20:32:37.1757665Z self = 2025-05-07T20:32:37.1758430Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:37.1758830Z 2025-05-07T20:32:37.1758940Z @given( 2025-05-07T20:32:37.1759251Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.1759652Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.1759972Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.1760310Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.1760650Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.1760947Z ) 2025-05-07T20:32:37.1761307Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.1761783Z def test_silu_mul_quant( 2025-05-07T20:32:37.1762030Z self, 2025-05-07T20:32:37.1762227Z T: int, 2025-05-07T20:32:37.1762433Z D: int, 2025-05-07T20:32:37.1762663Z scale_ub: Optional[float], 2025-05-07T20:32:37.1762936Z contiguous: bool, 2025-05-07T20:32:37.1763182Z compiled: bool, 2025-05-07T20:32:37.1763414Z ) -> None: 2025-05-07T20:32:37.1763631Z torch.manual_seed(2025) 2025-05-07T20:32:37.1763884Z 2025-05-07T20:32:37.1764163Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.1764660Z 2025-05-07T20:32:37.1764858Z x_sign = torch.sign(x) 2025-05-07T20:32:37.1765157Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:37.1765476Z x = x_sign * x_clamp 2025-05-07T20:32:37.1765715Z x0 = x[:, :D] 2025-05-07T20:32:37.1765936Z x1 = x[:, D:] 2025-05-07T20:32:37.1766144Z 2025-05-07T20:32:37.1766407Z if contiguous: 2025-05-07T20:32:37.1766646Z x0 = x0.contiguous() 2025-05-07T20:32:37.1766903Z x1 = x1.contiguous() 2025-05-07T20:32:37.1767144Z 2025-05-07T20:32:37.1767338Z if scale_ub is not None: 2025-05-07T20:32:37.1767605Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:37.1767948Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:37.1768261Z ) 2025-05-07T20:32:37.1768447Z else: 2025-05-07T20:32:37.1768655Z scale_ub_tensor = None 2025-05-07T20:32:37.1768908Z 2025-05-07T20:32:37.1769140Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.1769468Z op = silu_mul_quant 2025-05-07T20:32:37.1769728Z if compiled: 2025-05-07T20:32:37.1769983Z op = torch.compile(op) 2025-05-07T20:32:37.1770283Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.1770566Z 2025-05-07T20:32:37.1770766Z > y_fp8, y_scale = fn() 2025-05-07T20:32:37.1770938Z 2025-05-07T20:32:37.1771039Z moe/activation_test.py:117: 2025-05-07T20:32:37.1771343Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.1771696Z moe/activation_test.py:115: in fn 2025-05-07T20:32:37.1772013Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.1772750Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:37.1773497Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:37.1774066Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:37.1774803Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:37.1775509Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:37.1776086Z kernel = self.compile( 2025-05-07T20:32:37.1776663Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:37.1777455Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:37.1777862Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.1778105Z 2025-05-07T20:32:37.1778318Z self = 2025-05-07T20:32:37.1779483Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:37.1781003Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3dd5a4d1f0>} 2025-05-07T20:32:37.1782463Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:37.1783772Z context = 2025-05-07T20:32:37.1784081Z 2025-05-07T20:32:37.1784249Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:37.1784793Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:37.1785280Z module_map=module_map) 2025-05-07T20:32:37.1785653Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:37.1786103Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:37.1786360Z E ^ 2025-05-07T20:32:37.1786847Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:37.1787340Z 2025-05-07T20:32:37.1787785Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:37.1788400Z 2025-05-07T20:32:37.1788507Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.1788931Z self=, 2025-05-07T20:32:37.1789349Z T=1, 2025-05-07T20:32:37.1789533Z D=7168, 2025-05-07T20:32:37.1789802Z scale_ub=1200.0, 2025-05-07T20:32:37.1790027Z contiguous=True, 2025-05-07T20:32:37.1790255Z compiled=True, 2025-05-07T20:32:37.1790460Z ) 2025-05-07T20:32:37.1790779Z self = 2025-05-07T20:32:37.1791289Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:37.1791561Z 2025-05-07T20:32:37.1791644Z @given( 2025-05-07T20:32:37.1791871Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.1792191Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.1792507Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.1792840Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.1793177Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.1793471Z ) 2025-05-07T20:32:37.1793825Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.1794290Z def test_silu_mul_quant( 2025-05-07T20:32:37.1794533Z self, 2025-05-07T20:32:37.1794728Z T: int, 2025-05-07T20:32:37.1794922Z D: int, 2025-05-07T20:32:37.1795158Z scale_ub: Optional[float], 2025-05-07T20:32:37.1795458Z contiguous: bool, 2025-05-07T20:32:37.1795699Z compiled: bool, 2025-05-07T20:32:37.1795920Z ) -> None: 2025-05-07T20:32:37.1796136Z torch.manual_seed(2025) 2025-05-07T20:32:37.1796373Z 2025-05-07T20:32:37.1796645Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.1797002Z 2025-05-07T20:32:37.1797195Z x_sign = torch.sign(x) 2025-05-07T20:32:37.1797492Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:37.1797810Z x = x_sign * x_clamp 2025-05-07T20:32:37.1798049Z x0 = x[:, :D] 2025-05-07T20:32:37.1798390Z x1 = x[:, D:] 2025-05-07T20:32:37.1798604Z 2025-05-07T20:32:37.1798790Z if contiguous: 2025-05-07T20:32:37.1799019Z x0 = x0.contiguous() 2025-05-07T20:32:37.1799281Z x1 = x1.contiguous() 2025-05-07T20:32:37.1799526Z 2025-05-07T20:32:37.1799715Z if scale_ub is not None: 2025-05-07T20:32:37.1799991Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:37.1800331Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:37.1800641Z ) 2025-05-07T20:32:37.1800829Z else: 2025-05-07T20:32:37.1801039Z scale_ub_tensor = None 2025-05-07T20:32:37.1801288Z 2025-05-07T20:32:37.1801519Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.1801846Z op = silu_mul_quant 2025-05-07T20:32:37.1802092Z if compiled: 2025-05-07T20:32:37.1802338Z op = torch.compile(op) 2025-05-07T20:32:37.1802648Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.1802925Z 2025-05-07T20:32:37.1803115Z > y_fp8, y_scale = fn() 2025-05-07T20:32:37.1803280Z 2025-05-07T20:32:37.1803382Z moe/activation_test.py:117: 2025-05-07T20:32:37.1803681Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.1804019Z moe/activation_test.py:115: in fn 2025-05-07T20:32:37.1804306Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.1804952Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:37.1805548Z return fn(*args, **kwargs) 2025-05-07T20:32:37.1806258Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:37.1807070Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:37.1807633Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:37.1808361Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:37.1809065Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:37.1809630Z kernel = self.compile( 2025-05-07T20:32:37.1810194Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:37.1810891Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:37.1811303Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.1811543Z 2025-05-07T20:32:37.1811763Z self = 2025-05-07T20:32:37.1812932Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:37.1814430Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3dd4723820>} 2025-05-07T20:32:37.1815893Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:37.1816996Z context = 2025-05-07T20:32:37.1817300Z 2025-05-07T20:32:37.1817475Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:37.1818015Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:37.1818507Z module_map=module_map) 2025-05-07T20:32:37.1818886Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:37.1819323Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:37.1819585Z E ^ 2025-05-07T20:32:37.1820076Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:37.1820565Z 2025-05-07T20:32:37.1821015Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:37.1821571Z 2025-05-07T20:32:37.1821673Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.1822097Z self=, 2025-05-07T20:32:37.1822515Z T=1, 2025-05-07T20:32:37.1822689Z D=7168, 2025-05-07T20:32:37.1822878Z scale_ub=1200.0, 2025-05-07T20:32:37.1823100Z contiguous=False, 2025-05-07T20:32:37.1823321Z compiled=True, 2025-05-07T20:32:37.1823521Z ) 2025-05-07T20:32:37.3467806Z self = 2025-05-07T20:32:37.3468622Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:37.3469007Z 2025-05-07T20:32:37.3469111Z @given( 2025-05-07T20:32:37.3469434Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.3469900Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.3470215Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.3470550Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.3471010Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.3471308Z ) 2025-05-07T20:32:37.3471667Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.3472127Z def test_silu_mul_quant( 2025-05-07T20:32:37.3472370Z self, 2025-05-07T20:32:37.3472622Z T: int, 2025-05-07T20:32:37.3472816Z D: int, 2025-05-07T20:32:37.3473030Z scale_ub: Optional[float], 2025-05-07T20:32:37.3473296Z contiguous: bool, 2025-05-07T20:32:37.3473538Z compiled: bool, 2025-05-07T20:32:37.3473758Z ) -> None: 2025-05-07T20:32:37.3473963Z torch.manual_seed(2025) 2025-05-07T20:32:37.3474205Z 2025-05-07T20:32:37.3474477Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.3474830Z 2025-05-07T20:32:37.3475014Z x_sign = torch.sign(x) 2025-05-07T20:32:37.3475304Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:37.3475624Z x = x_sign * x_clamp 2025-05-07T20:32:37.3475857Z x0 = x[:, :D] 2025-05-07T20:32:37.3476074Z x1 = x[:, D:] 2025-05-07T20:32:37.3476279Z 2025-05-07T20:32:37.3476456Z if contiguous: 2025-05-07T20:32:37.3476687Z x0 = x0.contiguous() 2025-05-07T20:32:37.3476945Z x1 = x1.contiguous() 2025-05-07T20:32:37.3477188Z 2025-05-07T20:32:37.3477375Z if scale_ub is not None: 2025-05-07T20:32:37.3477646Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:37.3477987Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:37.3478302Z ) 2025-05-07T20:32:37.3478496Z else: 2025-05-07T20:32:37.3478697Z scale_ub_tensor = None 2025-05-07T20:32:37.3478951Z 2025-05-07T20:32:37.3479180Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.3479500Z op = silu_mul_quant 2025-05-07T20:32:37.3479744Z if compiled: 2025-05-07T20:32:37.3479991Z op = torch.compile(op) 2025-05-07T20:32:37.3480292Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.3480566Z 2025-05-07T20:32:37.3480753Z > y_fp8, y_scale = fn() 2025-05-07T20:32:37.3480918Z 2025-05-07T20:32:37.3481023Z moe/activation_test.py:117: 2025-05-07T20:32:37.3481316Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.3481658Z moe/activation_test.py:115: in fn 2025-05-07T20:32:37.3481942Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.3482663Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:37.3483425Z return fn(*args, **kwargs) 2025-05-07T20:32:37.3484137Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:37.3484877Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:37.3485470Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:37.3492019Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:37.3492741Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:37.3493326Z kernel = self.compile( 2025-05-07T20:32:37.3493900Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:37.3494606Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:37.3495025Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.3495287Z 2025-05-07T20:32:37.3495535Z self = 2025-05-07T20:32:37.3496696Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:37.3498311Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3dd46a64c0>} 2025-05-07T20:32:37.3499835Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:37.3500948Z context = 2025-05-07T20:32:37.3501248Z 2025-05-07T20:32:37.3501423Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:37.3501964Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:37.3502455Z module_map=module_map) 2025-05-07T20:32:37.3502839Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:37.3503195Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:37.3503465Z E ^ 2025-05-07T20:32:37.3503958Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:37.3504448Z 2025-05-07T20:32:37.3504901Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:37.3505457Z 2025-05-07T20:32:37.3505563Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.3505989Z self=, 2025-05-07T20:32:37.3506409Z T=1, 2025-05-07T20:32:37.3506585Z D=7168, 2025-05-07T20:32:37.3506775Z scale_ub=None, 2025-05-07T20:32:37.3506995Z contiguous=False, 2025-05-07T20:32:37.3507213Z compiled=True, 2025-05-07T20:32:37.3507418Z ) 2025-05-07T20:32:37.4646400Z self = 2025-05-07T20:32:37.4647195Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:37.4647587Z 2025-05-07T20:32:37.4647694Z @given( 2025-05-07T20:32:37.4648005Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.4648430Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.4648742Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.4649082Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.4649578Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.4649868Z ) 2025-05-07T20:32:37.4650232Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.4650704Z def test_silu_mul_quant( 2025-05-07T20:32:37.4650948Z self, 2025-05-07T20:32:37.4651147Z T: int, 2025-05-07T20:32:37.4651344Z D: int, 2025-05-07T20:32:37.4651558Z scale_ub: Optional[float], 2025-05-07T20:32:37.4651838Z contiguous: bool, 2025-05-07T20:32:37.4652083Z compiled: bool, 2025-05-07T20:32:37.4652307Z ) -> None: 2025-05-07T20:32:37.4652527Z torch.manual_seed(2025) 2025-05-07T20:32:37.4652777Z 2025-05-07T20:32:37.4653047Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.4653414Z 2025-05-07T20:32:37.4653616Z x_sign = torch.sign(x) 2025-05-07T20:32:37.4653914Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:37.4654227Z x = x_sign * x_clamp 2025-05-07T20:32:37.4654481Z x0 = x[:, :D] 2025-05-07T20:32:37.4654698Z x1 = x[:, D:] 2025-05-07T20:32:37.4654905Z 2025-05-07T20:32:37.4655092Z if contiguous: 2025-05-07T20:32:37.4655326Z x0 = x0.contiguous() 2025-05-07T20:32:37.4655587Z x1 = x1.contiguous() 2025-05-07T20:32:37.4655833Z 2025-05-07T20:32:37.4656027Z if scale_ub is not None: 2025-05-07T20:32:37.4656373Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:37.4658250Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:37.4658567Z ) 2025-05-07T20:32:37.4658753Z else: 2025-05-07T20:32:37.4658960Z scale_ub_tensor = None 2025-05-07T20:32:37.4659220Z 2025-05-07T20:32:37.4659444Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.4660443Z op = silu_mul_quant 2025-05-07T20:32:37.4660702Z if compiled: 2025-05-07T20:32:37.4660955Z op = torch.compile(op) 2025-05-07T20:32:37.4661260Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.4661554Z 2025-05-07T20:32:37.4661750Z y_fp8, y_scale = fn() 2025-05-07T20:32:37.4662035Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:37.4662340Z 2025-05-07T20:32:37.4662577Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.4662920Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:37.4663233Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:37.4663559Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:37.4663927Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:37.4664255Z 2025-05-07T20:32:37.4664458Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:37.4664662Z 2025-05-07T20:32:37.4664768Z moe/activation_test.py:126: 2025-05-07T20:32:37.4665088Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.4665439Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:37.4665774Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:37.4666620Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:37.4667424Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:37.4667999Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:37.4668736Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:37.4669471Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:37.4670385Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:37.4671279Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:37.4672084Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:37.4672860Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:37.4673539Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:37.4674171Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:37.4674725Z fn() 2025-05-07T20:32:37.4675250Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:37.4675873Z self.fn.run( 2025-05-07T20:32:37.4676363Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:37.4676928Z kernel = self.compile( 2025-05-07T20:32:37.4677507Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:37.4678208Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:37.4678627Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.4678871Z 2025-05-07T20:32:37.4679087Z self = 2025-05-07T20:32:37.4680263Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:37.4681848Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3dd4636040>} 2025-05-07T20:32:37.4683627Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:37.4684739Z context = 2025-05-07T20:32:37.4685039Z 2025-05-07T20:32:37.4685208Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:37.4685752Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:37.4686245Z module_map=module_map) 2025-05-07T20:32:37.4686612Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:37.4686978Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:37.4687245Z E ^ 2025-05-07T20:32:37.4687731Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:37.4688220Z 2025-05-07T20:32:37.4688670Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:37.4689226Z 2025-05-07T20:32:37.4689327Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.4689754Z self=, 2025-05-07T20:32:37.4690160Z T=1, 2025-05-07T20:32:37.4690336Z D=5120, 2025-05-07T20:32:37.4690527Z scale_ub=1200.0, 2025-05-07T20:32:37.4690739Z contiguous=False, 2025-05-07T20:32:37.4690964Z compiled=True, 2025-05-07T20:32:37.4691163Z ) 2025-05-07T20:32:37.8485189Z self = 2025-05-07T20:32:37.8485939Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:37.8486230Z 2025-05-07T20:32:37.8486317Z @given( 2025-05-07T20:32:37.8486564Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.8486887Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.8487207Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.8487796Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.8488141Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.8488443Z ) 2025-05-07T20:32:37.8488815Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.8489291Z def test_silu_mul_quant( 2025-05-07T20:32:37.8489538Z self, 2025-05-07T20:32:37.8489745Z T: int, 2025-05-07T20:32:37.8489960Z D: int, 2025-05-07T20:32:37.8490182Z scale_ub: Optional[float], 2025-05-07T20:32:37.8490464Z contiguous: bool, 2025-05-07T20:32:37.8490719Z compiled: bool, 2025-05-07T20:32:37.8490947Z ) -> None: 2025-05-07T20:32:37.8491167Z torch.manual_seed(2025) 2025-05-07T20:32:37.8491416Z 2025-05-07T20:32:37.8491694Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.8492053Z 2025-05-07T20:32:37.8492249Z x_sign = torch.sign(x) 2025-05-07T20:32:37.8492551Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:37.8492882Z x = x_sign * x_clamp 2025-05-07T20:32:37.8493130Z x0 = x[:, :D] 2025-05-07T20:32:37.8493355Z x1 = x[:, D:] 2025-05-07T20:32:37.8493570Z 2025-05-07T20:32:37.8493769Z if contiguous: 2025-05-07T20:32:37.8494004Z x0 = x0.contiguous() 2025-05-07T20:32:37.8494275Z x1 = x1.contiguous() 2025-05-07T20:32:37.8494618Z 2025-05-07T20:32:37.8494811Z if scale_ub is not None: 2025-05-07T20:32:37.8495095Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:37.8495444Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:37.8495824Z ) 2025-05-07T20:32:37.8496020Z else: 2025-05-07T20:32:37.8496284Z scale_ub_tensor = None 2025-05-07T20:32:37.8496706Z 2025-05-07T20:32:37.8496945Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.8497273Z op = silu_mul_quant 2025-05-07T20:32:37.8497536Z if compiled: 2025-05-07T20:32:37.8497780Z op = torch.compile(op) 2025-05-07T20:32:37.8498084Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.8498366Z 2025-05-07T20:32:37.8498565Z > y_fp8, y_scale = fn() 2025-05-07T20:32:37.8498808Z 2025-05-07T20:32:37.8498935Z moe/activation_test.py:117: 2025-05-07T20:32:37.8499242Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.8499597Z moe/activation_test.py:115: in fn 2025-05-07T20:32:37.8499877Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.8500477Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:37.8501084Z return fn(*args, **kwargs) 2025-05-07T20:32:37.8501793Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:37.8502547Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:37.8503130Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:37.8503874Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:37.8504586Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:37.8505160Z kernel = self.compile( 2025-05-07T20:32:37.8505744Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:37.8506443Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:37.8506868Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.8507123Z 2025-05-07T20:32:37.8507339Z self = 2025-05-07T20:32:37.8508612Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:37.8510258Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3dd4636f70>} 2025-05-07T20:32:37.8511724Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:37.8512842Z context = 2025-05-07T20:32:37.8513159Z 2025-05-07T20:32:37.8513328Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:37.8513886Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:37.8514384Z module_map=module_map) 2025-05-07T20:32:37.8514769Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:37.8515133Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:37.8515396Z E ^ 2025-05-07T20:32:37.8515892Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:37.8516390Z 2025-05-07T20:32:37.8516896Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:37.8517455Z 2025-05-07T20:32:37.8517561Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.8517982Z self=, 2025-05-07T20:32:37.8518449Z T=1, 2025-05-07T20:32:37.8518633Z D=5120, 2025-05-07T20:32:37.8518817Z scale_ub=1200.0, 2025-05-07T20:32:37.8519047Z contiguous=False, 2025-05-07T20:32:37.8519281Z compiled=False, 2025-05-07T20:32:37.8519488Z ) 2025-05-07T20:32:37.8519831Z self = 2025-05-07T20:32:37.8520356Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:37.8520640Z 2025-05-07T20:32:37.8520726Z @given( 2025-05-07T20:32:37.8520961Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.8521293Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.8521621Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.8521966Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.8522314Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.8522619Z ) 2025-05-07T20:32:37.8522983Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.8523459Z def test_silu_mul_quant( 2025-05-07T20:32:37.8523713Z self, 2025-05-07T20:32:37.8523909Z T: int, 2025-05-07T20:32:37.8524115Z D: int, 2025-05-07T20:32:37.8524346Z scale_ub: Optional[float], 2025-05-07T20:32:37.8524629Z contiguous: bool, 2025-05-07T20:32:37.8524874Z compiled: bool, 2025-05-07T20:32:37.8525111Z ) -> None: 2025-05-07T20:32:37.8525336Z torch.manual_seed(2025) 2025-05-07T20:32:37.8525583Z 2025-05-07T20:32:37.8525867Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.8526230Z 2025-05-07T20:32:37.8526423Z x_sign = torch.sign(x) 2025-05-07T20:32:37.8526724Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:37.8527048Z x = x_sign * x_clamp 2025-05-07T20:32:37.8527293Z x0 = x[:, :D] 2025-05-07T20:32:37.8527523Z x1 = x[:, D:] 2025-05-07T20:32:37.8527738Z 2025-05-07T20:32:37.8527923Z if contiguous: 2025-05-07T20:32:37.8528163Z x0 = x0.contiguous() 2025-05-07T20:32:37.8528434Z x1 = x1.contiguous() 2025-05-07T20:32:37.8528679Z 2025-05-07T20:32:37.8528963Z if scale_ub is not None: 2025-05-07T20:32:37.8529245Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:37.8529589Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:37.8529915Z ) 2025-05-07T20:32:37.8530118Z else: 2025-05-07T20:32:37.8530335Z scale_ub_tensor = None 2025-05-07T20:32:37.8530591Z 2025-05-07T20:32:37.8530829Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.8531163Z op = silu_mul_quant 2025-05-07T20:32:37.8531415Z if compiled: 2025-05-07T20:32:37.8531670Z op = torch.compile(op) 2025-05-07T20:32:37.8531982Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.8532268Z 2025-05-07T20:32:37.8532469Z > y_fp8, y_scale = fn() 2025-05-07T20:32:37.8532640Z 2025-05-07T20:32:37.8532748Z moe/activation_test.py:117: 2025-05-07T20:32:37.8533052Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.8533410Z moe/activation_test.py:115: in fn 2025-05-07T20:32:37.8533705Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.8534451Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:37.8535202Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:37.8535782Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:37.8536571Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:37.8537278Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:37.8537851Z kernel = self.compile( 2025-05-07T20:32:37.8538468Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:37.8539169Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:37.8539584Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.8539837Z 2025-05-07T20:32:37.8540055Z self = 2025-05-07T20:32:37.8541232Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:37.8542745Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3dd46d1700>} 2025-05-07T20:32:37.8544217Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:37.8545343Z context = 2025-05-07T20:32:37.8545660Z 2025-05-07T20:32:37.8545836Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:37.8546398Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:37.8546894Z module_map=module_map) 2025-05-07T20:32:37.8547280Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:37.8547654Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:37.8547926Z E ^ 2025-05-07T20:32:37.8548424Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:37.8548923Z 2025-05-07T20:32:37.8549374Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:37.8550009Z 2025-05-07T20:32:37.8550122Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.8550671Z self=, 2025-05-07T20:32:37.8551095Z T=16384, 2025-05-07T20:32:37.8551296Z D=5120, 2025-05-07T20:32:37.8551495Z scale_ub=1200.0, 2025-05-07T20:32:37.8551718Z contiguous=False, 2025-05-07T20:32:37.8551949Z compiled=True, 2025-05-07T20:32:37.8552161Z ) 2025-05-07T20:32:37.9762889Z self = 2025-05-07T20:32:37.9763748Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:37.9764179Z 2025-05-07T20:32:37.9764289Z @given( 2025-05-07T20:32:37.9764622Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.9765060Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.9765437Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.9765850Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.9766196Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.9766558Z ) 2025-05-07T20:32:37.9766933Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.9767414Z def test_silu_mul_quant( 2025-05-07T20:32:37.9767664Z self, 2025-05-07T20:32:37.9767873Z T: int, 2025-05-07T20:32:37.9768097Z D: int, 2025-05-07T20:32:37.9768360Z scale_ub: Optional[float], 2025-05-07T20:32:37.9768658Z contiguous: bool, 2025-05-07T20:32:37.9769185Z compiled: bool, 2025-05-07T20:32:37.9769434Z ) -> None: 2025-05-07T20:32:37.9769698Z torch.manual_seed(2025) 2025-05-07T20:32:37.9769977Z 2025-05-07T20:32:37.9770252Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.9770673Z 2025-05-07T20:32:37.9770999Z x_sign = torch.sign(x) 2025-05-07T20:32:37.9771345Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:37.9771725Z x = x_sign * x_clamp 2025-05-07T20:32:37.9772007Z x0 = x[:, :D] 2025-05-07T20:32:37.9772232Z x1 = x[:, D:] 2025-05-07T20:32:37.9772482Z 2025-05-07T20:32:37.9772699Z if contiguous: 2025-05-07T20:32:37.9772940Z x0 = x0.contiguous() 2025-05-07T20:32:37.9773227Z x1 = x1.contiguous() 2025-05-07T20:32:37.9773536Z 2025-05-07T20:32:37.9773736Z if scale_ub is not None: 2025-05-07T20:32:37.9774063Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:37.9774417Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:37.9774787Z ) 2025-05-07T20:32:37.9774994Z else: 2025-05-07T20:32:37.9775208Z scale_ub_tensor = None 2025-05-07T20:32:37.9775487Z 2025-05-07T20:32:37.9784305Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.9784710Z op = silu_mul_quant 2025-05-07T20:32:37.9784979Z if compiled: 2025-05-07T20:32:37.9785239Z op = torch.compile(op) 2025-05-07T20:32:37.9785547Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.9785837Z 2025-05-07T20:32:37.9786031Z > y_fp8, y_scale = fn() 2025-05-07T20:32:37.9786204Z 2025-05-07T20:32:37.9786305Z moe/activation_test.py:117: 2025-05-07T20:32:37.9786616Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.9786966Z moe/activation_test.py:115: in fn 2025-05-07T20:32:37.9787258Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.9787868Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:37.9788477Z return fn(*args, **kwargs) 2025-05-07T20:32:37.9789186Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:37.9790066Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:37.9790635Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:37.9791563Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:37.9792286Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:37.9792861Z kernel = self.compile( 2025-05-07T20:32:37.9793442Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:37.9794148Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:37.9794558Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.9794812Z 2025-05-07T20:32:37.9795026Z self = 2025-05-07T20:32:37.9796211Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:37.9797724Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3dd4addd30>} 2025-05-07T20:32:37.9799197Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:37.9800380Z context = 2025-05-07T20:32:37.9800688Z 2025-05-07T20:32:37.9800857Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:37.9801407Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:37.9801965Z module_map=module_map) 2025-05-07T20:32:37.9802339Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:37.9802707Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:37.9802974Z E ^ 2025-05-07T20:32:37.9803460Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:37.9803955Z 2025-05-07T20:32:37.9804403Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:37.9804969Z 2025-05-07T20:32:37.9805074Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.9805526Z self=, 2025-05-07T20:32:37.9805972Z T=2048, 2025-05-07T20:32:37.9806167Z D=7168, 2025-05-07T20:32:37.9806361Z scale_ub=1200.0, 2025-05-07T20:32:37.9806585Z contiguous=False, 2025-05-07T20:32:37.9806810Z compiled=True, 2025-05-07T20:32:37.9807020Z ) 2025-05-07T20:32:37.9807341Z self = 2025-05-07T20:32:37.9807862Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:37.9808151Z 2025-05-07T20:32:37.9808234Z @given( 2025-05-07T20:32:37.9808545Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.9808895Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.9809271Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.9809616Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.9809961Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.9810262Z ) 2025-05-07T20:32:37.9810628Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.9811087Z def test_silu_mul_quant( 2025-05-07T20:32:37.9811338Z self, 2025-05-07T20:32:37.9811544Z T: int, 2025-05-07T20:32:37.9811745Z D: int, 2025-05-07T20:32:37.9811971Z scale_ub: Optional[float], 2025-05-07T20:32:37.9812254Z contiguous: bool, 2025-05-07T20:32:37.9812592Z compiled: bool, 2025-05-07T20:32:37.9812823Z ) -> None: 2025-05-07T20:32:37.9813048Z torch.manual_seed(2025) 2025-05-07T20:32:37.9813291Z 2025-05-07T20:32:37.9813571Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.9813936Z 2025-05-07T20:32:37.9814131Z x_sign = torch.sign(x) 2025-05-07T20:32:37.9814434Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:37.9814810Z x = x_sign * x_clamp 2025-05-07T20:32:37.9815100Z x0 = x[:, :D] 2025-05-07T20:32:37.9815352Z x1 = x[:, D:] 2025-05-07T20:32:37.9815612Z 2025-05-07T20:32:37.9815806Z if contiguous: 2025-05-07T20:32:37.9816093Z x0 = x0.contiguous() 2025-05-07T20:32:37.9816371Z x1 = x1.contiguous() 2025-05-07T20:32:37.9816661Z 2025-05-07T20:32:37.9816876Z if scale_ub is not None: 2025-05-07T20:32:37.9817158Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:37.9817528Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:37.9817894Z ) 2025-05-07T20:32:37.9818097Z else: 2025-05-07T20:32:37.9818374Z scale_ub_tensor = None 2025-05-07T20:32:37.9818634Z 2025-05-07T20:32:37.9818870Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.9819205Z op = silu_mul_quant 2025-05-07T20:32:37.9819513Z if compiled: 2025-05-07T20:32:37.9819861Z op = torch.compile(op) 2025-05-07T20:32:37.9820208Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.9820530Z 2025-05-07T20:32:37.9820719Z > y_fp8, y_scale = fn() 2025-05-07T20:32:37.9820893Z 2025-05-07T20:32:37.9821011Z moe/activation_test.py:117: 2025-05-07T20:32:37.9821359Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.9821835Z moe/activation_test.py:115: in fn 2025-05-07T20:32:37.9822134Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.9822778Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:37.9823439Z return fn(*args, **kwargs) 2025-05-07T20:32:37.9824207Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:37.9825059Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:37.9825638Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:37.9826370Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:37.9827088Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:37.9827665Z kernel = self.compile( 2025-05-07T20:32:37.9828249Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:37.9828955Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:37.9829377Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.9829625Z 2025-05-07T20:32:37.9829964Z self = 2025-05-07T20:32:37.9831276Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:37.9833034Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3dd4059b80>} 2025-05-07T20:32:37.9834883Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:37.9836055Z context = 2025-05-07T20:32:37.9836361Z 2025-05-07T20:32:37.9836539Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:37.9837084Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:37.9837581Z module_map=module_map) 2025-05-07T20:32:37.9837958Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:37.9838320Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:37.9838589Z E ^ 2025-05-07T20:32:37.9839092Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:37.9839588Z 2025-05-07T20:32:37.9840050Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:37.9840611Z 2025-05-07T20:32:38.2521800Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:38.2522561Z self=, 2025-05-07T20:32:38.2523159Z T=1, 2025-05-07T20:32:38.2523415Z D=5120, 2025-05-07T20:32:38.2523672Z scale_ub=None, 2025-05-07T20:32:38.2523896Z contiguous=False, 2025-05-07T20:32:38.2524129Z compiled=False, 2025-05-07T20:32:38.2524337Z ) 2025-05-07T20:32:38.2524895Z self = 2025-05-07T20:32:38.2525409Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:38.2525684Z 2025-05-07T20:32:38.2525764Z @given( 2025-05-07T20:32:38.2526000Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:38.2526323Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:38.2526726Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:38.2527062Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:38.2527405Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:38.2527706Z ) 2025-05-07T20:32:38.2528065Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:38.2528532Z def test_silu_mul_quant( 2025-05-07T20:32:38.2528778Z self, 2025-05-07T20:32:38.2528965Z T: int, 2025-05-07T20:32:38.2529157Z D: int, 2025-05-07T20:32:38.2529375Z scale_ub: Optional[float], 2025-05-07T20:32:38.2529647Z contiguous: bool, 2025-05-07T20:32:38.2529889Z compiled: bool, 2025-05-07T20:32:38.2530114Z ) -> None: 2025-05-07T20:32:38.2530322Z torch.manual_seed(2025) 2025-05-07T20:32:38.2530568Z 2025-05-07T20:32:38.2530844Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:38.2531205Z 2025-05-07T20:32:38.2531392Z x_sign = torch.sign(x) 2025-05-07T20:32:38.2531688Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:38.2532007Z x = x_sign * x_clamp 2025-05-07T20:32:38.2532248Z x0 = x[:, :D] 2025-05-07T20:32:38.2532464Z x1 = x[:, D:] 2025-05-07T20:32:38.2532671Z 2025-05-07T20:32:38.2532852Z if contiguous: 2025-05-07T20:32:38.2533089Z x0 = x0.contiguous() 2025-05-07T20:32:38.2533348Z x1 = x1.contiguous() 2025-05-07T20:32:38.2533584Z 2025-05-07T20:32:38.2533775Z if scale_ub is not None: 2025-05-07T20:32:38.2534055Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:38.2534391Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:38.2534711Z ) 2025-05-07T20:32:38.2534907Z else: 2025-05-07T20:32:38.2535112Z scale_ub_tensor = None 2025-05-07T20:32:38.2535369Z 2025-05-07T20:32:38.2535605Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:38.2535933Z op = silu_mul_quant 2025-05-07T20:32:38.2536180Z if compiled: 2025-05-07T20:32:38.2536582Z op = torch.compile(op) 2025-05-07T20:32:38.2536894Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:38.2537173Z 2025-05-07T20:32:38.2537366Z > y_fp8, y_scale = fn() 2025-05-07T20:32:38.2537533Z 2025-05-07T20:32:38.2537638Z moe/activation_test.py:117: 2025-05-07T20:32:38.2537935Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:38.2538284Z moe/activation_test.py:115: in fn 2025-05-07T20:32:38.2538581Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:38.2539314Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:38.2540062Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:38.2540631Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:38.2541372Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:38.2542081Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:38.2542653Z kernel = self.compile( 2025-05-07T20:32:38.2543227Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:38.2543926Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:38.2544391Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:38.2544639Z 2025-05-07T20:32:38.2544850Z self = 2025-05-07T20:32:38.2546072Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:38.2547657Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3dd45925e0>} 2025-05-07T20:32:38.2549123Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:38.2550416Z context = 2025-05-07T20:32:38.2550724Z 2025-05-07T20:32:38.2550899Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:38.2551444Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:38.2551936Z module_map=module_map) 2025-05-07T20:32:38.2552316Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:38.2552672Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:38.2552940Z E ^ 2025-05-07T20:32:38.2553438Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:38.2553926Z 2025-05-07T20:32:38.2554379Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:38.2554934Z 2025-05-07T20:32:38.2555037Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:38.2555474Z self=, 2025-05-07T20:32:38.2555936Z T=4096, 2025-05-07T20:32:38.2556122Z D=7168, 2025-05-07T20:32:38.2556306Z scale_ub=1200.0, 2025-05-07T20:32:38.2556529Z contiguous=False, 2025-05-07T20:32:38.2556752Z compiled=False, 2025-05-07T20:32:38.2556949Z ) 2025-05-07T20:32:38.2557272Z self = 2025-05-07T20:32:38.2557794Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:38.2558085Z 2025-05-07T20:32:38.2558194Z @given( 2025-05-07T20:32:38.2558503Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:38.2558825Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:38.2559139Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:38.2559472Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:38.2559808Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:38.2560107Z ) 2025-05-07T20:32:38.2560468Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:38.2560924Z def test_silu_mul_quant( 2025-05-07T20:32:38.2561167Z self, 2025-05-07T20:32:38.2561358Z T: int, 2025-05-07T20:32:38.2561549Z D: int, 2025-05-07T20:32:38.2561767Z scale_ub: Optional[float], 2025-05-07T20:32:38.2562044Z contiguous: bool, 2025-05-07T20:32:38.2562280Z compiled: bool, 2025-05-07T20:32:38.2562501Z ) -> None: 2025-05-07T20:32:38.2562714Z torch.manual_seed(2025) 2025-05-07T20:32:38.2562953Z 2025-05-07T20:32:38.2563225Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:38.2563580Z 2025-05-07T20:32:38.2563764Z x_sign = torch.sign(x) 2025-05-07T20:32:38.2564059Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:38.2564376Z x = x_sign * x_clamp 2025-05-07T20:32:38.2564616Z x0 = x[:, :D] 2025-05-07T20:32:38.2564884Z x1 = x[:, D:] 2025-05-07T20:32:38.2565094Z 2025-05-07T20:32:38.2565272Z if contiguous: 2025-05-07T20:32:38.2565504Z x0 = x0.contiguous() 2025-05-07T20:32:38.2565766Z x1 = x1.contiguous() 2025-05-07T20:32:38.2566010Z 2025-05-07T20:32:38.2566192Z if scale_ub is not None: 2025-05-07T20:32:38.2566471Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:38.2566860Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:38.2567177Z ) 2025-05-07T20:32:38.2567369Z else: 2025-05-07T20:32:38.2567587Z scale_ub_tensor = None 2025-05-07T20:32:38.2567837Z 2025-05-07T20:32:38.2568072Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:38.2568394Z op = silu_mul_quant 2025-05-07T20:32:38.2568642Z if compiled: 2025-05-07T20:32:38.2568889Z op = torch.compile(op) 2025-05-07T20:32:38.2569196Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:38.2569474Z 2025-05-07T20:32:38.2569666Z > y_fp8, y_scale = fn() 2025-05-07T20:32:38.2569832Z 2025-05-07T20:32:38.2569936Z moe/activation_test.py:117: 2025-05-07T20:32:38.2570237Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:38.2570574Z moe/activation_test.py:115: in fn 2025-05-07T20:32:38.2570860Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:38.2571599Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:38.2572341Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:38.2572908Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:38.2573640Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:38.2574347Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:38.2574912Z kernel = self.compile( 2025-05-07T20:32:38.2575510Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:38.2576235Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:38.2576642Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:38.2576890Z 2025-05-07T20:32:38.2577101Z self = 2025-05-07T20:32:38.2578358Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:38.2579862Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3dd45928b0>} 2025-05-07T20:32:38.2581337Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:38.2582442Z context = 2025-05-07T20:32:38.2583010Z 2025-05-07T20:32:38.2583184Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:38.2583738Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:38.2584233Z module_map=module_map) 2025-05-07T20:32:38.2584601Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:38.2584964Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:38.2585225Z E ^ 2025-05-07T20:32:38.2585709Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:38.2586301Z 2025-05-07T20:32:38.2586753Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:38.2587316Z 2025-05-07T20:32:38.2587419Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:38.2587851Z self=, 2025-05-07T20:32:38.2588357Z T=16384, 2025-05-07T20:32:38.2588551Z D=7168, 2025-05-07T20:32:38.2588743Z scale_ub=None, 2025-05-07T20:32:38.2588954Z contiguous=True, 2025-05-07T20:32:38.2589179Z compiled=True, 2025-05-07T20:32:38.2589385Z ) 2025-05-07T20:32:38.3763444Z self = 2025-05-07T20:32:38.3764188Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:38.3764602Z 2025-05-07T20:32:38.3764718Z @given( 2025-05-07T20:32:38.3765041Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:38.3765489Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:38.3765924Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:38.3766274Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:38.3766608Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:38.3766903Z ) 2025-05-07T20:32:38.3767266Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:38.3767732Z def test_silu_mul_quant( 2025-05-07T20:32:38.3767979Z self, 2025-05-07T20:32:38.3768173Z T: int, 2025-05-07T20:32:38.3768375Z D: int, 2025-05-07T20:32:38.3768587Z scale_ub: Optional[float], 2025-05-07T20:32:38.3768860Z contiguous: bool, 2025-05-07T20:32:38.3769103Z compiled: bool, 2025-05-07T20:32:38.3769325Z ) -> None: 2025-05-07T20:32:38.3769543Z torch.manual_seed(2025) 2025-05-07T20:32:38.3769790Z 2025-05-07T20:32:38.3770058Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:38.3770419Z 2025-05-07T20:32:38.3770614Z x_sign = torch.sign(x) 2025-05-07T20:32:38.3770903Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:38.3771225Z x = x_sign * x_clamp 2025-05-07T20:32:38.3771466Z x0 = x[:, :D] 2025-05-07T20:32:38.3771681Z x1 = x[:, D:] 2025-05-07T20:32:38.3771896Z 2025-05-07T20:32:38.3772111Z if contiguous: 2025-05-07T20:32:38.3772352Z x0 = x0.contiguous() 2025-05-07T20:32:38.3772614Z x1 = x1.contiguous() 2025-05-07T20:32:38.3773016Z 2025-05-07T20:32:38.3773212Z if scale_ub is not None: 2025-05-07T20:32:38.3773491Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:38.3773828Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:38.3774151Z ) 2025-05-07T20:32:38.3774347Z else: 2025-05-07T20:32:38.3774559Z scale_ub_tensor = None 2025-05-07T20:32:38.3774811Z 2025-05-07T20:32:38.3775049Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:38.3775376Z op = silu_mul_quant 2025-05-07T20:32:38.3775627Z if compiled: 2025-05-07T20:32:38.3775877Z op = torch.compile(op) 2025-05-07T20:32:38.3776184Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:38.3776461Z 2025-05-07T20:32:38.3776656Z > y_fp8, y_scale = fn() 2025-05-07T20:32:38.3776823Z 2025-05-07T20:32:38.3776928Z moe/activation_test.py:117: 2025-05-07T20:32:38.3777232Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:38.3777581Z moe/activation_test.py:115: in fn 2025-05-07T20:32:38.3777872Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:38.3778464Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:38.3779059Z return fn(*args, **kwargs) 2025-05-07T20:32:38.3779768Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:38.3780581Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:38.3781142Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:38.3781873Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:38.3782645Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:38.3783399Z kernel = self.compile( 2025-05-07T20:32:38.3783971Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:38.3784685Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:38.3785105Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:38.3785349Z 2025-05-07T20:32:38.3785574Z self = 2025-05-07T20:32:38.3786744Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:38.3788252Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3dd42abc10>} 2025-05-07T20:32:38.3789872Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:38.3790986Z context = 2025-05-07T20:32:38.3791292Z 2025-05-07T20:32:38.3791463Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:38.3792020Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:38.3792516Z module_map=module_map) 2025-05-07T20:32:38.3792897Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:38.3793265Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:38.3793540Z E ^ 2025-05-07T20:32:38.3794036Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:38.3794528Z 2025-05-07T20:32:38.3795101Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:38.3795669Z 2025-05-07T20:32:38.3795773Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:38.3796202Z self=, 2025-05-07T20:32:38.3796626Z T=4096, 2025-05-07T20:32:38.3796807Z D=5120, 2025-05-07T20:32:38.3797001Z scale_ub=None, 2025-05-07T20:32:38.3797213Z contiguous=False, 2025-05-07T20:32:38.3797433Z compiled=True, 2025-05-07T20:32:38.3797636Z ) 2025-05-07T20:32:38.3797958Z self = 2025-05-07T20:32:38.3798475Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:38.3798770Z 2025-05-07T20:32:38.3798844Z @given( 2025-05-07T20:32:38.3799076Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:38.3799399Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:38.3799713Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:38.3807495Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:38.3807864Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:38.3808182Z ) 2025-05-07T20:32:38.3808560Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:38.3809051Z def test_silu_mul_quant( 2025-05-07T20:32:38.3809427Z self, 2025-05-07T20:32:38.3809632Z T: int, 2025-05-07T20:32:38.3809849Z D: int, 2025-05-07T20:32:38.3810088Z scale_ub: Optional[float], 2025-05-07T20:32:38.3810373Z contiguous: bool, 2025-05-07T20:32:38.3810636Z compiled: bool, 2025-05-07T20:32:38.3810880Z ) -> None: 2025-05-07T20:32:38.3811181Z torch.manual_seed(2025) 2025-05-07T20:32:38.3811431Z 2025-05-07T20:32:38.3811717Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:38.3812085Z 2025-05-07T20:32:38.3812284Z x_sign = torch.sign(x) 2025-05-07T20:32:38.3812589Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:38.3812914Z x = x_sign * x_clamp 2025-05-07T20:32:38.3813157Z x0 = x[:, :D] 2025-05-07T20:32:38.3813382Z x1 = x[:, D:] 2025-05-07T20:32:38.3813599Z 2025-05-07T20:32:38.3813786Z if contiguous: 2025-05-07T20:32:38.3814027Z x0 = x0.contiguous() 2025-05-07T20:32:38.3814299Z x1 = x1.contiguous() 2025-05-07T20:32:38.3814545Z 2025-05-07T20:32:38.3814743Z if scale_ub is not None: 2025-05-07T20:32:38.3815034Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:38.3815377Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:38.3815707Z ) 2025-05-07T20:32:38.3815908Z else: 2025-05-07T20:32:38.3816118Z scale_ub_tensor = None 2025-05-07T20:32:38.3816381Z 2025-05-07T20:32:38.3816629Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:38.3816961Z op = silu_mul_quant 2025-05-07T20:32:38.3817211Z if compiled: 2025-05-07T20:32:38.3817465Z op = torch.compile(op) 2025-05-07T20:32:38.3817777Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:38.3818063Z 2025-05-07T20:32:38.3818257Z > y_fp8, y_scale = fn() 2025-05-07T20:32:38.3818430Z 2025-05-07T20:32:38.3818541Z moe/activation_test.py:117: 2025-05-07T20:32:38.3818843Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:38.3819194Z moe/activation_test.py:115: in fn 2025-05-07T20:32:38.3819490Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:38.3820081Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:38.3820684Z return fn(*args, **kwargs) 2025-05-07T20:32:38.3821486Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:38.3822235Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:38.3822799Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:38.3823534Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:38.3824245Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:38.3824814Z kernel = self.compile( 2025-05-07T20:32:38.3825387Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:38.3826090Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:38.3826506Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:38.3826751Z 2025-05-07T20:32:38.3826976Z self = 2025-05-07T20:32:38.3828151Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:38.3829660Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3dd427d820>} 2025-05-07T20:32:38.3831274Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:38.3832386Z context = 2025-05-07T20:32:38.3832735Z 2025-05-07T20:32:38.3832908Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:38.3833467Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:38.3833964Z module_map=module_map) 2025-05-07T20:32:38.3834339Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:38.3834704Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:38.3834977Z E ^ 2025-05-07T20:32:38.3835503Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:38.3836023Z 2025-05-07T20:32:38.3836471Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:38.3837034Z 2025-05-07T20:32:38.7632266Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:38.7633242Z self=, 2025-05-07T20:32:38.7634090Z T=4096, 2025-05-07T20:32:38.7634457Z D=5120, 2025-05-07T20:32:38.7634839Z scale_ub=1200.0, 2025-05-07T20:32:38.7635292Z contiguous=False, 2025-05-07T20:32:38.7635624Z compiled=False, 2025-05-07T20:32:38.7635869Z ) 2025-05-07T20:32:38.7636211Z self = 2025-05-07T20:32:38.7636733Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:38.7637033Z 2025-05-07T20:32:38.7637114Z @given( 2025-05-07T20:32:38.7637347Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:38.7637670Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:38.7637977Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:38.7638315Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:38.7638654Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:38.7638949Z ) 2025-05-07T20:32:38.7639317Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:38.7639787Z def test_silu_mul_quant( 2025-05-07T20:32:38.7640198Z self, 2025-05-07T20:32:38.7640401Z T: int, 2025-05-07T20:32:38.7640601Z D: int, 2025-05-07T20:32:38.7640813Z scale_ub: Optional[float], 2025-05-07T20:32:38.7641093Z contiguous: bool, 2025-05-07T20:32:38.7641342Z compiled: bool, 2025-05-07T20:32:38.7641577Z ) -> None: 2025-05-07T20:32:38.7641797Z torch.manual_seed(2025) 2025-05-07T20:32:38.7642050Z 2025-05-07T20:32:38.7642334Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:38.7642689Z 2025-05-07T20:32:38.7642889Z x_sign = torch.sign(x) 2025-05-07T20:32:38.7643192Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:38.7643512Z x = x_sign * x_clamp 2025-05-07T20:32:38.7643761Z x0 = x[:, :D] 2025-05-07T20:32:38.7643986Z x1 = x[:, D:] 2025-05-07T20:32:38.7644194Z 2025-05-07T20:32:38.7644389Z if contiguous: 2025-05-07T20:32:38.7644631Z x0 = x0.contiguous() 2025-05-07T20:32:38.7644899Z x1 = x1.contiguous() 2025-05-07T20:32:38.7645150Z 2025-05-07T20:32:38.7645350Z if scale_ub is not None: 2025-05-07T20:32:38.7645665Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:38.7646026Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:38.7646352Z ) 2025-05-07T20:32:38.7646548Z else: 2025-05-07T20:32:38.7646764Z scale_ub_tensor = None 2025-05-07T20:32:38.7647136Z 2025-05-07T20:32:38.7647377Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:38.7647702Z op = silu_mul_quant 2025-05-07T20:32:38.7647959Z if compiled: 2025-05-07T20:32:38.7648214Z op = torch.compile(op) 2025-05-07T20:32:38.7648519Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:38.7648861Z 2025-05-07T20:32:38.7649050Z > y_fp8, y_scale = fn() 2025-05-07T20:32:38.7649216Z 2025-05-07T20:32:38.7649313Z moe/activation_test.py:117: 2025-05-07T20:32:38.7649619Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:38.7649964Z moe/activation_test.py:115: in fn 2025-05-07T20:32:38.7650243Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:38.7650981Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:38.7651726Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:38.7652297Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:38.7653020Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:38.7653729Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:38.7654298Z kernel = self.compile( 2025-05-07T20:32:38.7654874Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:38.7655572Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:38.7655984Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:38.7656223Z 2025-05-07T20:32:38.7656443Z self = 2025-05-07T20:32:38.7657610Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:38.7659119Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3dd4119280>} 2025-05-07T20:32:38.7660677Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:38.7661783Z context = 2025-05-07T20:32:38.7662086Z 2025-05-07T20:32:38.7662259Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:38.7662800Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:38.7663292Z module_map=module_map) 2025-05-07T20:32:38.7663669Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:38.7664030Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:38.7664287Z E ^ 2025-05-07T20:32:38.7664777Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:38.7665264Z 2025-05-07T20:32:38.7665716Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:38.7666274Z 2025-05-07T20:32:38.7666374Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:38.7666826Z self=, 2025-05-07T20:32:38.7667250Z T=4096, 2025-05-07T20:32:38.7667431Z D=5120, 2025-05-07T20:32:38.7667613Z scale_ub=1200.0, 2025-05-07T20:32:38.7667839Z contiguous=False, 2025-05-07T20:32:38.7668061Z compiled=True, 2025-05-07T20:32:38.7668314Z ) 2025-05-07T20:32:38.7668632Z self = 2025-05-07T20:32:38.7669155Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:38.7669445Z 2025-05-07T20:32:38.7669523Z @given( 2025-05-07T20:32:38.7669901Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:38.7670266Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:38.7670579Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:38.7670921Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:38.7671262Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:38.7671554Z ) 2025-05-07T20:32:38.7671918Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:38.7672377Z def test_silu_mul_quant( 2025-05-07T20:32:38.7672615Z self, 2025-05-07T20:32:38.7672807Z T: int, 2025-05-07T20:32:38.7672996Z D: int, 2025-05-07T20:32:38.7673214Z scale_ub: Optional[float], 2025-05-07T20:32:38.7673491Z contiguous: bool, 2025-05-07T20:32:38.7673728Z compiled: bool, 2025-05-07T20:32:38.7673949Z ) -> None: 2025-05-07T20:32:38.7674160Z torch.manual_seed(2025) 2025-05-07T20:32:38.7674399Z 2025-05-07T20:32:38.7674676Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:38.7675039Z 2025-05-07T20:32:38.7675224Z x_sign = torch.sign(x) 2025-05-07T20:32:38.7675522Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:38.7675838Z x = x_sign * x_clamp 2025-05-07T20:32:38.7676078Z x0 = x[:, :D] 2025-05-07T20:32:38.7676301Z x1 = x[:, D:] 2025-05-07T20:32:38.7676509Z 2025-05-07T20:32:38.7676688Z if contiguous: 2025-05-07T20:32:38.7676922Z x0 = x0.contiguous() 2025-05-07T20:32:38.7677181Z x1 = x1.contiguous() 2025-05-07T20:32:38.7677425Z 2025-05-07T20:32:38.7677613Z if scale_ub is not None: 2025-05-07T20:32:38.7677889Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:38.7678229Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:38.7678539Z ) 2025-05-07T20:32:38.7678727Z else: 2025-05-07T20:32:38.7678937Z scale_ub_tensor = None 2025-05-07T20:32:38.7679191Z 2025-05-07T20:32:38.7679418Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:38.7679746Z op = silu_mul_quant 2025-05-07T20:32:38.7679992Z if compiled: 2025-05-07T20:32:38.7680334Z op = torch.compile(op) 2025-05-07T20:32:38.7680639Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:38.7680914Z 2025-05-07T20:32:38.7681103Z > y_fp8, y_scale = fn() 2025-05-07T20:32:38.7681274Z 2025-05-07T20:32:38.7681370Z moe/activation_test.py:117: 2025-05-07T20:32:38.7681670Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:38.7682010Z moe/activation_test.py:115: in fn 2025-05-07T20:32:38.7682298Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:38.7683041Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:38.7683634Z return fn(*args, **kwargs) 2025-05-07T20:32:38.7684340Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:38.7685081Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:38.7685662Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:38.7686430Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:38.7687139Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:38.7687703Z kernel = self.compile( 2025-05-07T20:32:38.7688407Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:38.7689195Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:38.7689653Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:38.7689982Z 2025-05-07T20:32:38.7690223Z self = 2025-05-07T20:32:38.7691573Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:38.7693321Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3dd4119700>} 2025-05-07T20:32:38.7695025Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:38.7696303Z context = 2025-05-07T20:32:38.7696644Z 2025-05-07T20:32:38.7696834Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:38.7697452Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:38.7698009Z module_map=module_map) 2025-05-07T20:32:38.7698421Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:38.7698815Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:38.7699099Z E ^ 2025-05-07T20:32:38.7699646Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:38.7700205Z 2025-05-07T20:32:38.7700846Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:38.7701405Z 2025-05-07T20:32:39.0459887Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:39.0460373Z self=, 2025-05-07T20:32:39.0460808Z T=2048, 2025-05-07T20:32:39.0461066Z D=7168, 2025-05-07T20:32:39.0461346Z scale_ub=1200.0, 2025-05-07T20:32:39.0461669Z contiguous=False, 2025-05-07T20:32:39.0462036Z compiled=False, 2025-05-07T20:32:39.0462313Z ) 2025-05-07T20:32:39.0462899Z self = 2025-05-07T20:32:39.0463438Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:39.0463735Z 2025-05-07T20:32:39.0463814Z @given( 2025-05-07T20:32:39.0464051Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:39.0464374Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:39.0464687Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:39.0465027Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:39.0465374Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:39.0465665Z ) 2025-05-07T20:32:39.0466028Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:39.0466496Z def test_silu_mul_quant( 2025-05-07T20:32:39.0466738Z self, 2025-05-07T20:32:39.0466926Z T: int, 2025-05-07T20:32:39.0467130Z D: int, 2025-05-07T20:32:39.0467356Z scale_ub: Optional[float], 2025-05-07T20:32:39.0467626Z contiguous: bool, 2025-05-07T20:32:39.0467869Z compiled: bool, 2025-05-07T20:32:39.0468099Z ) -> None: 2025-05-07T20:32:39.0468307Z torch.manual_seed(2025) 2025-05-07T20:32:39.0468559Z 2025-05-07T20:32:39.0468835Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:39.0469187Z 2025-05-07T20:32:39.0469444Z x_sign = torch.sign(x) 2025-05-07T20:32:39.0469866Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:39.0470177Z x = x_sign * x_clamp 2025-05-07T20:32:39.0470421Z x0 = x[:, :D] 2025-05-07T20:32:39.0470638Z x1 = x[:, D:] 2025-05-07T20:32:39.0470841Z 2025-05-07T20:32:39.0471027Z if contiguous: 2025-05-07T20:32:39.0471327Z x0 = x0.contiguous() 2025-05-07T20:32:39.0471590Z x1 = x1.contiguous() 2025-05-07T20:32:39.0471826Z 2025-05-07T20:32:39.0472021Z if scale_ub is not None: 2025-05-07T20:32:39.0472317Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:39.0472658Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:39.0472976Z ) 2025-05-07T20:32:39.0473165Z else: 2025-05-07T20:32:39.0473368Z scale_ub_tensor = None 2025-05-07T20:32:39.0473627Z 2025-05-07T20:32:39.0473853Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:39.0474173Z op = silu_mul_quant 2025-05-07T20:32:39.0474425Z if compiled: 2025-05-07T20:32:39.0474670Z op = torch.compile(op) 2025-05-07T20:32:39.0474968Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:39.0475254Z 2025-05-07T20:32:39.0475445Z > y_fp8, y_scale = fn() 2025-05-07T20:32:39.0475617Z 2025-05-07T20:32:39.0475715Z moe/activation_test.py:117: 2025-05-07T20:32:39.0476018Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:39.0476367Z moe/activation_test.py:115: in fn 2025-05-07T20:32:39.0476651Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:39.0477382Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:39.0478124Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:39.0478696Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:39.0479423Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:39.0480131Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:39.0480694Z kernel = self.compile( 2025-05-07T20:32:39.0481270Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:39.0481960Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:39.0482496Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:39.0482906Z 2025-05-07T20:32:39.0483132Z self = 2025-05-07T20:32:39.0484301Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:39.0485802Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3dd409a790>} 2025-05-07T20:32:39.0487269Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:39.0488381Z context = 2025-05-07T20:32:39.0488683Z 2025-05-07T20:32:39.0488858Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:39.0489402Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:39.0489891Z module_map=module_map) 2025-05-07T20:32:39.0490262Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:39.0490703Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:39.0490972Z E ^ 2025-05-07T20:32:39.0491471Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:39.0491966Z 2025-05-07T20:32:39.0492423Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:39.0493039Z 2025-05-07T20:32:39.0493145Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:39.0493570Z self=, 2025-05-07T20:32:39.0493990Z T=1, 2025-05-07T20:32:39.0494170Z D=7168, 2025-05-07T20:32:39.0494358Z scale_ub=None, 2025-05-07T20:32:39.0494569Z contiguous=True, 2025-05-07T20:32:39.0494793Z compiled=False, 2025-05-07T20:32:39.0494993Z ) 2025-05-07T20:32:39.0495318Z self = 2025-05-07T20:32:39.0495828Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:39.0496099Z 2025-05-07T20:32:39.0496173Z @given( 2025-05-07T20:32:39.0496400Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:39.0496720Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:39.0497034Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:39.0497371Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:39.0497710Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:39.0498005Z ) 2025-05-07T20:32:39.0498365Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:39.0498828Z def test_silu_mul_quant( 2025-05-07T20:32:39.0499073Z self, 2025-05-07T20:32:39.0499262Z T: int, 2025-05-07T20:32:39.0499456Z D: int, 2025-05-07T20:32:39.0499681Z scale_ub: Optional[float], 2025-05-07T20:32:39.0499951Z contiguous: bool, 2025-05-07T20:32:39.0500194Z compiled: bool, 2025-05-07T20:32:39.0500415Z ) -> None: 2025-05-07T20:32:39.0500625Z torch.manual_seed(2025) 2025-05-07T20:32:39.0500873Z 2025-05-07T20:32:39.0501145Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:39.0501503Z 2025-05-07T20:32:39.0501689Z x_sign = torch.sign(x) 2025-05-07T20:32:39.0501985Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:39.0502301Z x = x_sign * x_clamp 2025-05-07T20:32:39.0502542Z x0 = x[:, :D] 2025-05-07T20:32:39.0502892Z x1 = x[:, D:] 2025-05-07T20:32:39.0503103Z 2025-05-07T20:32:39.0503282Z if contiguous: 2025-05-07T20:32:39.0503515Z x0 = x0.contiguous() 2025-05-07T20:32:39.0503783Z x1 = x1.contiguous() 2025-05-07T20:32:39.0504026Z 2025-05-07T20:32:39.0504216Z if scale_ub is not None: 2025-05-07T20:32:39.0504492Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:39.0504828Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:39.0505149Z ) 2025-05-07T20:32:39.0505335Z else: 2025-05-07T20:32:39.0505539Z scale_ub_tensor = None 2025-05-07T20:32:39.0505801Z 2025-05-07T20:32:39.0506033Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:39.0506352Z op = silu_mul_quant 2025-05-07T20:32:39.0512052Z if compiled: 2025-05-07T20:32:39.0512354Z op = torch.compile(op) 2025-05-07T20:32:39.0512667Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:39.0512967Z 2025-05-07T20:32:39.0513159Z > y_fp8, y_scale = fn() 2025-05-07T20:32:39.0513327Z 2025-05-07T20:32:39.0513427Z moe/activation_test.py:117: 2025-05-07T20:32:39.0513733Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:39.0514088Z moe/activation_test.py:115: in fn 2025-05-07T20:32:39.0514376Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:39.0515203Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:39.0515958Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:39.0516531Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:39.0517309Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:39.0518028Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:39.0518604Z kernel = self.compile( 2025-05-07T20:32:39.0519175Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:39.0519879Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:39.0520299Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:39.0520543Z 2025-05-07T20:32:39.0520762Z self = 2025-05-07T20:32:39.0521929Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:39.0523444Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3dd40d40d0>} 2025-05-07T20:32:39.0524913Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:39.0526021Z context = 2025-05-07T20:32:39.0526324Z 2025-05-07T20:32:39.0526500Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:39.0527049Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:39.0527543Z module_map=module_map) 2025-05-07T20:32:39.0527921Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:39.0528282Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:39.0528552Z E ^ 2025-05-07T20:32:39.0529046Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:39.0529620Z 2025-05-07T20:32:39.0530077Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:39.0530635Z 2025-05-07T20:32:39.0530741Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:39.0531166Z self=, 2025-05-07T20:32:39.0531593Z T=16384, 2025-05-07T20:32:39.0531786Z D=7168, 2025-05-07T20:32:39.0531978Z scale_ub=1200.0, 2025-05-07T20:32:39.0532203Z contiguous=False, 2025-05-07T20:32:39.0532426Z compiled=True, 2025-05-07T20:32:39.0532634Z ) 2025-05-07T20:32:39.2439438Z self = 2025-05-07T20:32:39.2440277Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:39.2440691Z 2025-05-07T20:32:39.2440802Z @given( 2025-05-07T20:32:39.2441103Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:39.2441536Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:39.2441945Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:39.2442371Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:39.2442741Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:39.2443038Z ) 2025-05-07T20:32:39.2443391Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:39.2443987Z def test_silu_mul_quant( 2025-05-07T20:32:39.2444228Z self, 2025-05-07T20:32:39.2444414Z T: int, 2025-05-07T20:32:39.2444603Z D: int, 2025-05-07T20:32:39.2444817Z scale_ub: Optional[float], 2025-05-07T20:32:39.2445085Z contiguous: bool, 2025-05-07T20:32:39.2445325Z compiled: bool, 2025-05-07T20:32:39.2445625Z ) -> None: 2025-05-07T20:32:39.2445845Z torch.manual_seed(2025) 2025-05-07T20:32:39.2446095Z 2025-05-07T20:32:39.2446379Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:39.2446735Z 2025-05-07T20:32:39.2446926Z x_sign = torch.sign(x) 2025-05-07T20:32:39.2447223Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:39.2447546Z x = x_sign * x_clamp 2025-05-07T20:32:39.2447792Z x0 = x[:, :D] 2025-05-07T20:32:39.2448014Z x1 = x[:, D:] 2025-05-07T20:32:39.2448226Z 2025-05-07T20:32:39.2448416Z if contiguous: 2025-05-07T20:32:39.2448649Z x0 = x0.contiguous() 2025-05-07T20:32:39.2448913Z x1 = x1.contiguous() 2025-05-07T20:32:39.2449156Z 2025-05-07T20:32:39.2449351Z if scale_ub is not None: 2025-05-07T20:32:39.2449633Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:39.2449975Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:39.2450295Z ) 2025-05-07T20:32:39.2450487Z else: 2025-05-07T20:32:39.2450702Z scale_ub_tensor = None 2025-05-07T20:32:39.2450957Z 2025-05-07T20:32:39.2451196Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:39.2451523Z op = silu_mul_quant 2025-05-07T20:32:39.2451769Z if compiled: 2025-05-07T20:32:39.2452016Z op = torch.compile(op) 2025-05-07T20:32:39.2452323Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:39.2452596Z 2025-05-07T20:32:39.2452781Z > y_fp8, y_scale = fn() 2025-05-07T20:32:39.2452948Z 2025-05-07T20:32:39.2453054Z moe/activation_test.py:117: 2025-05-07T20:32:39.2453347Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:39.2453691Z moe/activation_test.py:115: in fn 2025-05-07T20:32:39.2453977Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:39.2454566Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:39.2455154Z return fn(*args, **kwargs) 2025-05-07T20:32:39.2455986Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:39.2456726Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:39.2457286Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:39.2458018Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:39.2458723Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:39.2459284Z kernel = self.compile( 2025-05-07T20:32:39.2459848Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:39.2460546Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:39.2460957Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:39.2461196Z 2025-05-07T20:32:39.2461415Z self = 2025-05-07T20:32:39.2462584Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:39.2464089Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3dd40d4d30>} 2025-05-07T20:32:39.2465605Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:39.2466802Z context = 2025-05-07T20:32:39.2467104Z 2025-05-07T20:32:39.2467271Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:39.2467819Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:39.2468304Z module_map=module_map) 2025-05-07T20:32:39.2468678Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:39.2469032Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:39.2469293Z E ^ 2025-05-07T20:32:39.2469897Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:39.2470386Z 2025-05-07T20:32:39.2470831Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:39.2471389Z 2025-05-07T20:32:39.2471491Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:39.2471915Z self=, 2025-05-07T20:32:39.2472338Z T=1, 2025-05-07T20:32:39.2472520Z D=7168, 2025-05-07T20:32:39.2472718Z scale_ub=None, 2025-05-07T20:32:39.2472939Z contiguous=False, 2025-05-07T20:32:39.2473163Z compiled=False, 2025-05-07T20:32:39.2473375Z ) 2025-05-07T20:32:39.2473696Z self = 2025-05-07T20:32:39.2474206Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:39.2474485Z 2025-05-07T20:32:39.2474566Z @given( 2025-05-07T20:32:39.2474798Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:39.2475126Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:39.2475438Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:39.2475785Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:39.2476175Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:39.2476470Z ) 2025-05-07T20:32:39.2476833Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:39.2477414Z def test_silu_mul_quant( 2025-05-07T20:32:39.2477653Z self, 2025-05-07T20:32:39.2477848Z T: int, 2025-05-07T20:32:39.2478044Z D: int, 2025-05-07T20:32:39.2478256Z scale_ub: Optional[float], 2025-05-07T20:32:39.2478526Z contiguous: bool, 2025-05-07T20:32:39.2478766Z compiled: bool, 2025-05-07T20:32:39.2479007Z ) -> None: 2025-05-07T20:32:39.2479216Z torch.manual_seed(2025) 2025-05-07T20:32:39.2479462Z 2025-05-07T20:32:39.2479735Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:39.2480081Z 2025-05-07T20:32:39.2480267Z x_sign = torch.sign(x) 2025-05-07T20:32:39.2480556Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:39.2480870Z x = x_sign * x_clamp 2025-05-07T20:32:39.2481108Z x0 = x[:, :D] 2025-05-07T20:32:39.2481324Z x1 = x[:, D:] 2025-05-07T20:32:39.2481525Z 2025-05-07T20:32:39.2481702Z if contiguous: 2025-05-07T20:32:39.2481937Z x0 = x0.contiguous() 2025-05-07T20:32:39.2482194Z x1 = x1.contiguous() 2025-05-07T20:32:39.2482432Z 2025-05-07T20:32:39.2482624Z if scale_ub is not None: 2025-05-07T20:32:39.2483053Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:39.2483493Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:39.2483809Z ) 2025-05-07T20:32:39.2483997Z else: 2025-05-07T20:32:39.2484283Z scale_ub_tensor = None 2025-05-07T20:32:39.2484555Z 2025-05-07T20:32:39.2484793Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:39.2485135Z op = silu_mul_quant 2025-05-07T20:32:39.2485401Z if compiled: 2025-05-07T20:32:39.2485661Z op = torch.compile(op) 2025-05-07T20:32:39.2486048Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:39.2486344Z 2025-05-07T20:32:39.2486539Z > y_fp8, y_scale = fn() 2025-05-07T20:32:39.2486718Z 2025-05-07T20:32:39.2486827Z moe/activation_test.py:117: 2025-05-07T20:32:39.2487147Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:39.2487521Z moe/activation_test.py:115: in fn 2025-05-07T20:32:39.2487825Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:39.2488650Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:39.2489493Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:39.2490126Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:39.2490950Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:39.2491741Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:39.2492377Z kernel = self.compile( 2025-05-07T20:32:39.2493017Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:39.2493799Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:39.2494261Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:39.2494533Z 2025-05-07T20:32:39.2494766Z self = 2025-05-07T20:32:39.2495971Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:39.2497471Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3dd4151700>} 2025-05-07T20:32:39.2499058Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:39.2500162Z context = 2025-05-07T20:32:39.2500467Z 2025-05-07T20:32:39.2500646Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:39.2501185Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:39.2501677Z module_map=module_map) 2025-05-07T20:32:39.2502050Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:39.2502403Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:39.2502664Z E ^ 2025-05-07T20:32:39.2503152Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:39.2503638Z 2025-05-07T20:32:39.2504093Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:39.2504645Z 2025-05-07T20:32:39.2504746Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:39.2505172Z self=, 2025-05-07T20:32:39.2505590Z T=2048, 2025-05-07T20:32:39.2505794Z D=7168, 2025-05-07T20:32:39.2506003Z scale_ub=None, 2025-05-07T20:32:39.2506215Z contiguous=False, 2025-05-07T20:32:39.2506481Z compiled=True, 2025-05-07T20:32:39.2506682Z ) 2025-05-07T20:32:39.3683553Z self = 2025-05-07T20:32:39.3684147Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:39.3684563Z 2025-05-07T20:32:39.3684676Z @given( 2025-05-07T20:32:39.3685149Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:39.3685570Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:39.3685909Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:39.3686268Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:39.3686602Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:39.3686890Z ) 2025-05-07T20:32:39.3687244Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:39.3687706Z def test_silu_mul_quant( 2025-05-07T20:32:39.3687953Z self, 2025-05-07T20:32:39.3688138Z T: int, 2025-05-07T20:32:39.3688336Z D: int, 2025-05-07T20:32:39.3688549Z scale_ub: Optional[float], 2025-05-07T20:32:39.3688819Z contiguous: bool, 2025-05-07T20:32:39.3689052Z compiled: bool, 2025-05-07T20:32:39.3689272Z ) -> None: 2025-05-07T20:32:39.3689482Z torch.manual_seed(2025) 2025-05-07T20:32:39.3689752Z 2025-05-07T20:32:39.3690017Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:39.3690371Z 2025-05-07T20:32:39.3690557Z x_sign = torch.sign(x) 2025-05-07T20:32:39.3690852Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:39.3691160Z x = x_sign * x_clamp 2025-05-07T20:32:39.3691397Z x0 = x[:, :D] 2025-05-07T20:32:39.3691609Z x1 = x[:, D:] 2025-05-07T20:32:39.3691809Z 2025-05-07T20:32:39.3691990Z if contiguous: 2025-05-07T20:32:39.3692216Z x0 = x0.contiguous() 2025-05-07T20:32:39.3692472Z x1 = x1.contiguous() 2025-05-07T20:32:39.3692711Z 2025-05-07T20:32:39.3692899Z if scale_ub is not None: 2025-05-07T20:32:39.3693169Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:39.3693509Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:39.3693828Z ) 2025-05-07T20:32:39.3694006Z else: 2025-05-07T20:32:39.3694210Z scale_ub_tensor = None 2025-05-07T20:32:39.3694460Z 2025-05-07T20:32:39.3694683Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:39.3695002Z op = silu_mul_quant 2025-05-07T20:32:39.3695954Z if compiled: 2025-05-07T20:32:39.3696210Z op = torch.compile(op) 2025-05-07T20:32:39.3696507Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:39.3696790Z 2025-05-07T20:32:39.3696975Z > y_fp8, y_scale = fn() 2025-05-07T20:32:39.3697144Z 2025-05-07T20:32:39.3697244Z moe/activation_test.py:117: 2025-05-07T20:32:39.3697545Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:39.3697887Z moe/activation_test.py:115: in fn 2025-05-07T20:32:39.3698165Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:39.3698754Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:39.3699353Z return fn(*args, **kwargs) 2025-05-07T20:32:39.3700056Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:39.3700799Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:39.3701360Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:39.3702091Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:39.3702791Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:39.3703423Z kernel = self.compile( 2025-05-07T20:32:39.3703991Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:39.3704686Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:39.3705091Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:39.3705382Z 2025-05-07T20:32:39.3705595Z self = 2025-05-07T20:32:39.3706821Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:39.3708373Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3dd41de3a0>} 2025-05-07T20:32:39.3709952Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:39.3711053Z context = 2025-05-07T20:32:39.3711359Z 2025-05-07T20:32:39.3711529Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:39.3712076Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:39.3712563Z module_map=module_map) 2025-05-07T20:32:39.3712929Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:39.3713286Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:39.3713542Z E ^ 2025-05-07T20:32:39.3714023Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:39.3714515Z 2025-05-07T20:32:39.3714965Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:39.3715524Z 2025-05-07T20:32:39.3715624Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:39.3716045Z self=, 2025-05-07T20:32:39.3716458Z T=4096, 2025-05-07T20:32:39.3716641Z D=7168, 2025-05-07T20:32:39.3716828Z scale_ub=None, 2025-05-07T20:32:39.3717036Z contiguous=False, 2025-05-07T20:32:39.3717256Z compiled=True, 2025-05-07T20:32:39.3717542Z ) 2025-05-07T20:32:39.3717861Z self = 2025-05-07T20:32:39.3718376Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:39.3718667Z 2025-05-07T20:32:39.3718740Z @given( 2025-05-07T20:32:39.3718966Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:39.3719282Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:39.3719600Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:39.3719939Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:39.3720268Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:39.3720560Z ) 2025-05-07T20:32:39.3720918Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:39.3721377Z def test_silu_mul_quant( 2025-05-07T20:32:39.3721612Z self, 2025-05-07T20:32:39.3721800Z T: int, 2025-05-07T20:32:39.3721989Z D: int, 2025-05-07T20:32:39.3722210Z scale_ub: Optional[float], 2025-05-07T20:32:39.3722481Z contiguous: bool, 2025-05-07T20:32:39.3722717Z compiled: bool, 2025-05-07T20:32:39.3722930Z ) -> None: 2025-05-07T20:32:39.3723142Z torch.manual_seed(2025) 2025-05-07T20:32:39.3723389Z 2025-05-07T20:32:39.3723658Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:39.3724057Z 2025-05-07T20:32:39.3724245Z x_sign = torch.sign(x) 2025-05-07T20:32:39.3724527Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:39.3724838Z x = x_sign * x_clamp 2025-05-07T20:32:39.3725075Z x0 = x[:, :D] 2025-05-07T20:32:39.3725285Z x1 = x[:, D:] 2025-05-07T20:32:39.3725680Z 2025-05-07T20:32:39.3725912Z if contiguous: 2025-05-07T20:32:39.3726135Z x0 = x0.contiguous() 2025-05-07T20:32:39.3726391Z x1 = x1.contiguous() 2025-05-07T20:32:39.3726632Z 2025-05-07T20:32:39.3726823Z if scale_ub is not None: 2025-05-07T20:32:39.3727094Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:39.3727433Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:39.3727748Z ) 2025-05-07T20:32:39.3727932Z else: 2025-05-07T20:32:39.3728135Z scale_ub_tensor = None 2025-05-07T20:32:39.3728388Z 2025-05-07T20:32:39.3728610Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:39.3728932Z op = silu_mul_quant 2025-05-07T20:32:39.3729178Z if compiled: 2025-05-07T20:32:39.3729420Z op = torch.compile(op) 2025-05-07T20:32:39.3729717Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:39.3729995Z 2025-05-07T20:32:39.3730176Z > y_fp8, y_scale = fn() 2025-05-07T20:32:39.3730348Z 2025-05-07T20:32:39.3730444Z moe/activation_test.py:117: 2025-05-07T20:32:39.3730742Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:39.3731085Z moe/activation_test.py:115: in fn 2025-05-07T20:32:39.3731366Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:39.3731950Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:39.3732543Z return fn(*args, **kwargs) 2025-05-07T20:32:39.3733241Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:39.3733982Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:39.3734547Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:39.3735268Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:39.3736024Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:39.3736588Z kernel = self.compile( 2025-05-07T20:32:39.3737243Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:39.3737937Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:39.3738344Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:39.3738586Z 2025-05-07T20:32:39.3738805Z self = 2025-05-07T20:32:39.3739969Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:39.3741465Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3dd41de700>} 2025-05-07T20:32:39.3742933Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:39.3744040Z context = 2025-05-07T20:32:39.3744344Z 2025-05-07T20:32:39.3744516Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:39.3745058Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:39.3745605Z module_map=module_map) 2025-05-07T20:32:39.3745978Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:39.3746337Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:39.3746593Z E ^ 2025-05-07T20:32:39.3747120Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:39.3747603Z 2025-05-07T20:32:39.3748065Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:39.3748619Z 2025-05-07T20:32:39.7652173Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:39.7652685Z self=, 2025-05-07T20:32:39.7653184Z T=16384, 2025-05-07T20:32:39.7660093Z D=5120, 2025-05-07T20:32:39.7660310Z scale_ub=1200.0, 2025-05-07T20:32:39.7660587Z contiguous=False, 2025-05-07T20:32:39.7660822Z compiled=False, 2025-05-07T20:32:39.7661023Z ) 2025-05-07T20:32:39.7661355Z self = 2025-05-07T20:32:39.7661889Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:39.7662194Z 2025-05-07T20:32:39.7662269Z @given( 2025-05-07T20:32:39.7662498Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:39.7662819Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:39.7663141Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:39.7663473Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:39.7663809Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:39.7664105Z ) 2025-05-07T20:32:39.7664456Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:39.7664918Z def test_silu_mul_quant( 2025-05-07T20:32:39.7665160Z self, 2025-05-07T20:32:39.7665342Z T: int, 2025-05-07T20:32:39.7665536Z D: int, 2025-05-07T20:32:39.7665750Z scale_ub: Optional[float], 2025-05-07T20:32:39.7666014Z contiguous: bool, 2025-05-07T20:32:39.7666261Z compiled: bool, 2025-05-07T20:32:39.7666485Z ) -> None: 2025-05-07T20:32:39.7666698Z torch.manual_seed(2025) 2025-05-07T20:32:39.7666942Z 2025-05-07T20:32:39.7667217Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:39.7667571Z 2025-05-07T20:32:39.7667928Z x_sign = torch.sign(x) 2025-05-07T20:32:39.7668224Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:39.7668537Z x = x_sign * x_clamp 2025-05-07T20:32:39.7668773Z x0 = x[:, :D] 2025-05-07T20:32:39.7668987Z x1 = x[:, D:] 2025-05-07T20:32:39.7669203Z 2025-05-07T20:32:39.7669380Z if contiguous: 2025-05-07T20:32:39.7669608Z x0 = x0.contiguous() 2025-05-07T20:32:39.7669969Z x1 = x1.contiguous() 2025-05-07T20:32:39.7670209Z 2025-05-07T20:32:39.7670394Z if scale_ub is not None: 2025-05-07T20:32:39.7670659Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:39.7670996Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:39.7671305Z ) 2025-05-07T20:32:39.7671486Z else: 2025-05-07T20:32:39.7671692Z scale_ub_tensor = None 2025-05-07T20:32:39.7671946Z 2025-05-07T20:32:39.7672166Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:39.7672494Z op = silu_mul_quant 2025-05-07T20:32:39.7672745Z if compiled: 2025-05-07T20:32:39.7672987Z op = torch.compile(op) 2025-05-07T20:32:39.7673283Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:39.7673562Z 2025-05-07T20:32:39.7673745Z > y_fp8, y_scale = fn() 2025-05-07T20:32:39.7673907Z 2025-05-07T20:32:39.7674002Z moe/activation_test.py:117: 2025-05-07T20:32:39.7674378Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:39.7674720Z moe/activation_test.py:115: in fn 2025-05-07T20:32:39.7674998Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:39.7675738Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:39.7676541Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:39.7677108Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:39.7677830Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:39.7678532Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:39.7679091Z kernel = self.compile( 2025-05-07T20:32:39.7679657Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:39.7680355Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:39.7680765Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:39.7681004Z 2025-05-07T20:32:39.7681219Z self = 2025-05-07T20:32:39.7682383Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:39.7684080Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3dd3e83790>} 2025-05-07T20:32:39.7685537Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:39.7686642Z context = 2025-05-07T20:32:39.7686943Z 2025-05-07T20:32:39.7687111Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:39.7687651Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:39.7688138Z module_map=module_map) 2025-05-07T20:32:39.7688514Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:39.7688988Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:39.7689256Z E ^ 2025-05-07T20:32:39.7689747Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:39.7690230Z 2025-05-07T20:32:39.7690678Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:39.7691233Z 2025-05-07T20:32:39.7691334Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:39.7691754Z self=, 2025-05-07T20:32:39.7692171Z T=16384, 2025-05-07T20:32:39.7692351Z D=5120, 2025-05-07T20:32:39.7692537Z scale_ub=1200.0, 2025-05-07T20:32:39.7692752Z contiguous=True, 2025-05-07T20:32:39.7692960Z compiled=True, 2025-05-07T20:32:39.7693156Z ) 2025-05-07T20:32:39.7693475Z self = 2025-05-07T20:32:39.7694001Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:39.7694292Z 2025-05-07T20:32:39.7694370Z @given( 2025-05-07T20:32:39.7694591Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:39.7694906Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:39.7695211Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:39.7695542Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:39.7695946Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:39.7696231Z ) 2025-05-07T20:32:39.7696584Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:39.7697038Z def test_silu_mul_quant( 2025-05-07T20:32:39.7697273Z self, 2025-05-07T20:32:39.7697529Z T: int, 2025-05-07T20:32:39.7697724Z D: int, 2025-05-07T20:32:39.7697938Z scale_ub: Optional[float], 2025-05-07T20:32:39.7698205Z contiguous: bool, 2025-05-07T20:32:39.7698446Z compiled: bool, 2025-05-07T20:32:39.7698664Z ) -> None: 2025-05-07T20:32:39.7698871Z torch.manual_seed(2025) 2025-05-07T20:32:39.7699111Z 2025-05-07T20:32:39.7699377Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:39.7699724Z 2025-05-07T20:32:39.7699909Z x_sign = torch.sign(x) 2025-05-07T20:32:39.7700196Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:39.7700511Z x = x_sign * x_clamp 2025-05-07T20:32:39.7700750Z x0 = x[:, :D] 2025-05-07T20:32:39.7700963Z x1 = x[:, D:] 2025-05-07T20:32:39.7701163Z 2025-05-07T20:32:39.7701345Z if contiguous: 2025-05-07T20:32:39.7701575Z x0 = x0.contiguous() 2025-05-07T20:32:39.7701832Z x1 = x1.contiguous() 2025-05-07T20:32:39.7702067Z 2025-05-07T20:32:39.7702251Z if scale_ub is not None: 2025-05-07T20:32:39.7702517Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:39.7702858Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:39.7703168Z ) 2025-05-07T20:32:39.7703353Z else: 2025-05-07T20:32:39.7703551Z scale_ub_tensor = None 2025-05-07T20:32:39.7703807Z 2025-05-07T20:32:39.7704033Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:39.7704346Z op = silu_mul_quant 2025-05-07T20:32:39.7704593Z if compiled: 2025-05-07T20:32:39.7704839Z op = torch.compile(op) 2025-05-07T20:32:39.7705134Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:39.7705413Z 2025-05-07T20:32:39.7705602Z > y_fp8, y_scale = fn() 2025-05-07T20:32:39.7705775Z 2025-05-07T20:32:39.7705884Z moe/activation_test.py:117: 2025-05-07T20:32:39.7706207Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:39.7706550Z moe/activation_test.py:115: in fn 2025-05-07T20:32:39.7706829Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:39.7707496Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:39.7708092Z return fn(*args, **kwargs) 2025-05-07T20:32:39.7708790Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:39.7709522Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:39.7710160Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:39.7710886Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:39.7711584Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:39.7712144Z kernel = self.compile( 2025-05-07T20:32:39.7712708Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:39.7713404Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:39.7713805Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:39.7714049Z 2025-05-07T20:32:39.7714258Z self = 2025-05-07T20:32:39.7715416Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:39.7716963Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3dd3d91550>} 2025-05-07T20:32:39.7718500Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:39.7719599Z context = 2025-05-07T20:32:39.7719911Z 2025-05-07T20:32:39.7720078Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:39.7720622Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:39.7721105Z module_map=module_map) 2025-05-07T20:32:39.7721472Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:39.7721830Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:39.7722086Z E ^ 2025-05-07T20:32:39.7722564Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:39.7723053Z 2025-05-07T20:32:39.7723495Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:39.7724050Z 2025-05-07T20:32:39.9951083Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:39.9951516Z self=, 2025-05-07T20:32:39.9951977Z T=16384, 2025-05-07T20:32:39.9952193Z D=5120, 2025-05-07T20:32:39.9952461Z scale_ub=None, 2025-05-07T20:32:39.9952722Z contiguous=False, 2025-05-07T20:32:39.9952953Z compiled=True, 2025-05-07T20:32:39.9953159Z ) 2025-05-07T20:32:39.9953496Z self = 2025-05-07T20:32:39.9954020Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:39.9954324Z 2025-05-07T20:32:39.9954400Z @given( 2025-05-07T20:32:39.9954634Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:39.9954960Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:39.9955281Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:39.9955630Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:39.9956133Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:39.9956433Z ) 2025-05-07T20:32:39.9956791Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:39.9957254Z def test_silu_mul_quant( 2025-05-07T20:32:39.9957489Z self, 2025-05-07T20:32:39.9957686Z T: int, 2025-05-07T20:32:39.9957875Z D: int, 2025-05-07T20:32:39.9958096Z scale_ub: Optional[float], 2025-05-07T20:32:39.9958367Z contiguous: bool, 2025-05-07T20:32:39.9958599Z compiled: bool, 2025-05-07T20:32:39.9958822Z ) -> None: 2025-05-07T20:32:39.9959035Z torch.manual_seed(2025) 2025-05-07T20:32:39.9959270Z 2025-05-07T20:32:39.9959536Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:39.9959889Z 2025-05-07T20:32:39.9960071Z x_sign = torch.sign(x) 2025-05-07T20:32:39.9960363Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:39.9960681Z x = x_sign * x_clamp 2025-05-07T20:32:39.9960917Z x0 = x[:, :D] 2025-05-07T20:32:39.9961132Z x1 = x[:, D:] 2025-05-07T20:32:39.9961333Z 2025-05-07T20:32:39.9961512Z if contiguous: 2025-05-07T20:32:39.9961750Z x0 = x0.contiguous() 2025-05-07T20:32:39.9962008Z x1 = x1.contiguous() 2025-05-07T20:32:39.9962240Z 2025-05-07T20:32:39.9962426Z if scale_ub is not None: 2025-05-07T20:32:39.9962766Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:39.9963103Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:39.9963412Z ) 2025-05-07T20:32:39.9963592Z else: 2025-05-07T20:32:39.9963801Z scale_ub_tensor = None 2025-05-07T20:32:39.9964045Z 2025-05-07T20:32:39.9964338Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:39.9964660Z op = silu_mul_quant 2025-05-07T20:32:39.9964905Z if compiled: 2025-05-07T20:32:39.9965157Z op = torch.compile(op) 2025-05-07T20:32:39.9965452Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:39.9965723Z 2025-05-07T20:32:39.9965911Z > y_fp8, y_scale = fn() 2025-05-07T20:32:39.9966076Z 2025-05-07T20:32:39.9966177Z moe/activation_test.py:117: 2025-05-07T20:32:39.9966469Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:39.9966814Z moe/activation_test.py:115: in fn 2025-05-07T20:32:39.9967095Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:39.9967679Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:39.9968266Z return fn(*args, **kwargs) 2025-05-07T20:32:39.9968969Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:39.9969711Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:39.9970272Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:39.9970999Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:39.9971706Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:39.9972269Z kernel = self.compile( 2025-05-07T20:32:39.9972833Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:39.9973541Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:39.9973951Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:39.9974190Z 2025-05-07T20:32:39.9974411Z self = 2025-05-07T20:32:39.9975655Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:39.9977161Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3dd3d910d0>} 2025-05-07T20:32:39.9978635Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:39.9979745Z context = 2025-05-07T20:32:39.9980048Z 2025-05-07T20:32:39.9980221Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:39.9980765Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:39.9981250Z module_map=module_map) 2025-05-07T20:32:39.9981631Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:39.9981984Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:39.9982252Z E ^ 2025-05-07T20:32:39.9982922Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:39.9983415Z 2025-05-07T20:32:39.9983865Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:39.9984492Z 2025-05-07T20:32:39.9984594Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:39.9985021Z self=, 2025-05-07T20:32:39.9985438Z T=2048, 2025-05-07T20:32:39.9985619Z D=5120, 2025-05-07T20:32:39.9985892Z scale_ub=None, 2025-05-07T20:32:39.9986126Z contiguous=False, 2025-05-07T20:32:39.9986346Z compiled=True, 2025-05-07T20:32:39.9986546Z ) 2025-05-07T20:32:40.1197903Z self = 2025-05-07T20:32:40.1198430Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:40.1198740Z 2025-05-07T20:32:40.1198825Z @given( 2025-05-07T20:32:40.1199140Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:40.1199529Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:40.1199844Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:40.1200178Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:40.1200513Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:40.1200804Z ) 2025-05-07T20:32:40.1201166Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:40.1201622Z def test_silu_mul_quant( 2025-05-07T20:32:40.1201865Z self, 2025-05-07T20:32:40.1202051Z T: int, 2025-05-07T20:32:40.1202237Z D: int, 2025-05-07T20:32:40.1202448Z scale_ub: Optional[float], 2025-05-07T20:32:40.1202722Z contiguous: bool, 2025-05-07T20:32:40.1202954Z compiled: bool, 2025-05-07T20:32:40.1203173Z ) -> None: 2025-05-07T20:32:40.1203383Z torch.manual_seed(2025) 2025-05-07T20:32:40.1203615Z 2025-05-07T20:32:40.1203887Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:40.1204239Z 2025-05-07T20:32:40.1204422Z x_sign = torch.sign(x) 2025-05-07T20:32:40.1204716Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:40.1205030Z x = x_sign * x_clamp 2025-05-07T20:32:40.1205266Z x0 = x[:, :D] 2025-05-07T20:32:40.1205482Z x1 = x[:, D:] 2025-05-07T20:32:40.1205688Z 2025-05-07T20:32:40.1205884Z if contiguous: 2025-05-07T20:32:40.1206138Z x0 = x0.contiguous() 2025-05-07T20:32:40.1206399Z x1 = x1.contiguous() 2025-05-07T20:32:40.1206648Z 2025-05-07T20:32:40.1206831Z if scale_ub is not None: 2025-05-07T20:32:40.1207268Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:40.1207618Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:40.1207927Z ) 2025-05-07T20:32:40.1208112Z else: 2025-05-07T20:32:40.1208315Z scale_ub_tensor = None 2025-05-07T20:32:40.1208561Z 2025-05-07T20:32:40.1208788Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:40.1209115Z op = silu_mul_quant 2025-05-07T20:32:40.1209362Z if compiled: 2025-05-07T20:32:40.1209606Z op = torch.compile(op) 2025-05-07T20:32:40.1209907Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:40.1210179Z 2025-05-07T20:32:40.1210364Z > y_fp8, y_scale = fn() 2025-05-07T20:32:40.1210528Z 2025-05-07T20:32:40.1210635Z moe/activation_test.py:117: 2025-05-07T20:32:40.1210931Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:40.1211268Z moe/activation_test.py:115: in fn 2025-05-07T20:32:40.1211558Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:40.1212144Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:40.1212731Z return fn(*args, **kwargs) 2025-05-07T20:32:40.1213430Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:40.1214234Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:40.1214795Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:40.1215515Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:40.1216218Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:40.1216846Z kernel = self.compile( 2025-05-07T20:32:40.1217411Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:40.1218100Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:40.1218510Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:40.1218748Z 2025-05-07T20:32:40.1218960Z self = 2025-05-07T20:32:40.1220115Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:40.1221616Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3dd3d14af0>} 2025-05-07T20:32:40.1223090Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:40.1224192Z context = 2025-05-07T20:32:40.1224499Z 2025-05-07T20:32:40.1224672Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:40.1225209Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:40.1225696Z module_map=module_map) 2025-05-07T20:32:40.1226067Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:40.1226418Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:40.1226675Z E ^ 2025-05-07T20:32:40.1227161Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:40.1227648Z 2025-05-07T20:32:40.1228098Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:40.1228734Z 2025-05-07T20:32:40.1228835Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:40.1229263Z self=, 2025-05-07T20:32:40.1229678Z T=2048, 2025-05-07T20:32:40.1229950Z D=5120, 2025-05-07T20:32:40.1230135Z scale_ub=1200.0, 2025-05-07T20:32:40.1230354Z contiguous=False, 2025-05-07T20:32:40.1230570Z compiled=True, 2025-05-07T20:32:40.1230793Z ) 2025-05-07T20:32:40.1231113Z self = 2025-05-07T20:32:40.1231630Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:40.1231916Z 2025-05-07T20:32:40.1231992Z @given( 2025-05-07T20:32:40.1232214Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:40.1232532Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:40.1232845Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:40.1233190Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:40.1233531Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:40.1233827Z ) 2025-05-07T20:32:40.1234183Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:40.1234646Z def test_silu_mul_quant( 2025-05-07T20:32:40.1234898Z self, 2025-05-07T20:32:40.1235088Z T: int, 2025-05-07T20:32:40.1235354Z D: int, 2025-05-07T20:32:40.1235565Z scale_ub: Optional[float], 2025-05-07T20:32:40.1235834Z contiguous: bool, 2025-05-07T20:32:40.1236073Z compiled: bool, 2025-05-07T20:32:40.1236292Z ) -> None: 2025-05-07T20:32:40.1236498Z torch.manual_seed(2025) 2025-05-07T20:32:40.1236735Z 2025-05-07T20:32:40.1237049Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:40.1237398Z 2025-05-07T20:32:40.1237580Z x_sign = torch.sign(x) 2025-05-07T20:32:40.1237873Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:40.1238189Z x = x_sign * x_clamp 2025-05-07T20:32:40.1238423Z x0 = x[:, :D] 2025-05-07T20:32:40.1238642Z x1 = x[:, D:] 2025-05-07T20:32:40.1238842Z 2025-05-07T20:32:40.1239016Z if contiguous: 2025-05-07T20:32:40.1239242Z x0 = x0.contiguous() 2025-05-07T20:32:40.1239494Z x1 = x1.contiguous() 2025-05-07T20:32:40.1239728Z 2025-05-07T20:32:40.1239915Z if scale_ub is not None: 2025-05-07T20:32:40.1240186Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:40.1240518Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:40.1240839Z ) 2025-05-07T20:32:40.1241031Z else: 2025-05-07T20:32:40.1241239Z scale_ub_tensor = None 2025-05-07T20:32:40.1241489Z 2025-05-07T20:32:40.1241714Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:40.1247864Z op = silu_mul_quant 2025-05-07T20:32:40.1248136Z if compiled: 2025-05-07T20:32:40.1248395Z op = torch.compile(op) 2025-05-07T20:32:40.1248699Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:40.1248981Z 2025-05-07T20:32:40.1249177Z > y_fp8, y_scale = fn() 2025-05-07T20:32:40.1249345Z 2025-05-07T20:32:40.1249444Z moe/activation_test.py:117: 2025-05-07T20:32:40.1249752Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:40.1250101Z moe/activation_test.py:115: in fn 2025-05-07T20:32:40.1250384Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:40.1250986Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:40.1251585Z return fn(*args, **kwargs) 2025-05-07T20:32:40.1252298Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:40.1253041Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:40.1253751Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:40.1254489Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:40.1255193Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:40.1255765Z kernel = self.compile( 2025-05-07T20:32:40.1256344Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:40.1257045Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:40.1257456Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:40.1257704Z 2025-05-07T20:32:40.1257919Z self = 2025-05-07T20:32:40.1259091Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:40.1260591Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3dd3adf820>} 2025-05-07T20:32:40.1262057Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:40.1263212Z context = 2025-05-07T20:32:40.1263529Z 2025-05-07T20:32:40.1263701Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:40.1264292Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:40.1264785Z module_map=module_map) 2025-05-07T20:32:40.1265165Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:40.1265527Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:40.1265795Z E ^ 2025-05-07T20:32:40.1266282Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:40.1266780Z 2025-05-07T20:32:40.1267235Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:40.1267789Z 2025-05-07T20:32:40.3504175Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:40.3505479Z self=, 2025-05-07T20:32:40.3506246Z T=4096, 2025-05-07T20:32:40.3506465Z D=5120, 2025-05-07T20:32:40.3506651Z scale_ub=1200.0, 2025-05-07T20:32:40.3506871Z contiguous=True, 2025-05-07T20:32:40.3507096Z compiled=True, 2025-05-07T20:32:40.3507302Z ) 2025-05-07T20:32:40.3507642Z self = 2025-05-07T20:32:40.3508179Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:40.3508476Z 2025-05-07T20:32:40.3508556Z @given( 2025-05-07T20:32:40.3508787Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:40.3509113Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:40.3509428Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:40.3509917Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:40.3510257Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:40.3510555Z ) 2025-05-07T20:32:40.3510917Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:40.3511397Z def test_silu_mul_quant( 2025-05-07T20:32:40.3511643Z self, 2025-05-07T20:32:40.3511829Z T: int, 2025-05-07T20:32:40.3512028Z D: int, 2025-05-07T20:32:40.3512437Z scale_ub: Optional[float], 2025-05-07T20:32:40.3512719Z contiguous: bool, 2025-05-07T20:32:40.3512967Z compiled: bool, 2025-05-07T20:32:40.3513195Z ) -> None: 2025-05-07T20:32:40.3513411Z torch.manual_seed(2025) 2025-05-07T20:32:40.3513663Z 2025-05-07T20:32:40.3513945Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:40.3514309Z 2025-05-07T20:32:40.3514515Z x_sign = torch.sign(x) 2025-05-07T20:32:40.3514812Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:40.3515128Z x = x_sign * x_clamp 2025-05-07T20:32:40.3515377Z x0 = x[:, :D] 2025-05-07T20:32:40.3515595Z x1 = x[:, D:] 2025-05-07T20:32:40.3515811Z 2025-05-07T20:32:40.3515993Z if contiguous: 2025-05-07T20:32:40.3516235Z x0 = x0.contiguous() 2025-05-07T20:32:40.3516505Z x1 = x1.contiguous() 2025-05-07T20:32:40.3516750Z 2025-05-07T20:32:40.3516953Z if scale_ub is not None: 2025-05-07T20:32:40.3517231Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:40.3517570Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:40.3517886Z ) 2025-05-07T20:32:40.3518073Z else: 2025-05-07T20:32:40.3518271Z scale_ub_tensor = None 2025-05-07T20:32:40.3518530Z 2025-05-07T20:32:40.3518753Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:40.3519203Z op = silu_mul_quant 2025-05-07T20:32:40.3519464Z if compiled: 2025-05-07T20:32:40.3519718Z op = torch.compile(op) 2025-05-07T20:32:40.3520023Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:40.3520315Z 2025-05-07T20:32:40.3520505Z > y_fp8, y_scale = fn() 2025-05-07T20:32:40.3520738Z 2025-05-07T20:32:40.3520837Z moe/activation_test.py:117: 2025-05-07T20:32:40.3521128Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:40.3521479Z moe/activation_test.py:115: in fn 2025-05-07T20:32:40.3521765Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:40.3522355Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:40.3522955Z return fn(*args, **kwargs) 2025-05-07T20:32:40.3523665Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:40.3524418Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:40.3524986Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:40.3525724Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:40.3526444Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:40.3527008Z kernel = self.compile( 2025-05-07T20:32:40.3527578Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:40.3528272Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:40.3528681Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:40.3528923Z 2025-05-07T20:32:40.3529134Z self = 2025-05-07T20:32:40.3530300Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:40.3531811Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3dd3bf2430>} 2025-05-07T20:32:40.3533363Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:40.3534465Z context = 2025-05-07T20:32:40.3534766Z 2025-05-07T20:32:40.3534933Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:40.3535472Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:40.3535964Z module_map=module_map) 2025-05-07T20:32:40.3536332Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:40.3536692Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:40.3536958Z E ^ 2025-05-07T20:32:40.3537437Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:40.3537925Z 2025-05-07T20:32:40.3538377Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:40.3538933Z 2025-05-07T20:32:40.3539031Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:40.3539455Z self=, 2025-05-07T20:32:40.3539869Z T=128, 2025-05-07T20:32:40.3540050Z D=5120, 2025-05-07T20:32:40.3540236Z scale_ub=1200.0, 2025-05-07T20:32:40.3540496Z contiguous=False, 2025-05-07T20:32:40.3540720Z compiled=True, 2025-05-07T20:32:40.3540918Z ) 2025-05-07T20:32:40.6725801Z self = 2025-05-07T20:32:40.6726422Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:40.6726846Z 2025-05-07T20:32:40.6727145Z @given( 2025-05-07T20:32:40.6727461Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:40.6727904Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:40.6728317Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:40.6728737Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:40.6729169Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:40.6729498Z ) 2025-05-07T20:32:40.6729857Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:40.6730322Z def test_silu_mul_quant( 2025-05-07T20:32:40.6730567Z self, 2025-05-07T20:32:40.6730761Z T: int, 2025-05-07T20:32:40.6730962Z D: int, 2025-05-07T20:32:40.6731187Z scale_ub: Optional[float], 2025-05-07T20:32:40.6731457Z contiguous: bool, 2025-05-07T20:32:40.6731705Z compiled: bool, 2025-05-07T20:32:40.6731936Z ) -> None: 2025-05-07T20:32:40.6732147Z torch.manual_seed(2025) 2025-05-07T20:32:40.6732398Z 2025-05-07T20:32:40.6732678Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:40.6733039Z 2025-05-07T20:32:40.6733229Z x_sign = torch.sign(x) 2025-05-07T20:32:40.6733528Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:40.6733846Z x = x_sign * x_clamp 2025-05-07T20:32:40.6734086Z x0 = x[:, :D] 2025-05-07T20:32:40.6734303Z x1 = x[:, D:] 2025-05-07T20:32:40.6734516Z 2025-05-07T20:32:40.6734697Z if contiguous: 2025-05-07T20:32:40.6734929Z x0 = x0.contiguous() 2025-05-07T20:32:40.6735195Z x1 = x1.contiguous() 2025-05-07T20:32:40.6735438Z 2025-05-07T20:32:40.6735632Z if scale_ub is not None: 2025-05-07T20:32:40.6735910Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:40.6736289Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:40.6736619Z ) 2025-05-07T20:32:40.6736812Z else: 2025-05-07T20:32:40.6737019Z scale_ub_tensor = None 2025-05-07T20:32:40.6737278Z 2025-05-07T20:32:40.6737513Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:40.6737968Z op = silu_mul_quant 2025-05-07T20:32:40.6738222Z if compiled: 2025-05-07T20:32:40.6738469Z op = torch.compile(op) 2025-05-07T20:32:40.6738769Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:40.6739043Z 2025-05-07T20:32:40.6739225Z > y_fp8, y_scale = fn() 2025-05-07T20:32:40.6739389Z 2025-05-07T20:32:40.6739486Z moe/activation_test.py:117: 2025-05-07T20:32:40.6739783Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:40.6740123Z moe/activation_test.py:115: in fn 2025-05-07T20:32:40.6740409Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:40.6740991Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:40.6741584Z return fn(*args, **kwargs) 2025-05-07T20:32:40.6742285Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:40.6743028Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:40.6743585Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:40.6744313Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:40.6745012Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:40.6745635Z kernel = self.compile( 2025-05-07T20:32:40.6746197Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:40.6746884Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:40.6747295Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:40.6747578Z 2025-05-07T20:32:40.6747787Z self = 2025-05-07T20:32:40.6748960Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:40.6750614Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3dd3a36040>} 2025-05-07T20:32:40.6752090Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:40.6753200Z context = 2025-05-07T20:32:40.6753508Z 2025-05-07T20:32:40.6753678Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:40.6754225Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:40.6754717Z module_map=module_map) 2025-05-07T20:32:40.6755090Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:40.6755453Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:40.6755721Z E ^ 2025-05-07T20:32:40.6756212Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:40.6756699Z 2025-05-07T20:32:40.6757144Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:40.6757707Z 2025-05-07T20:32:40.6757810Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:40.6758239Z self=, 2025-05-07T20:32:40.6758667Z T=16384, 2025-05-07T20:32:40.6758861Z D=7168, 2025-05-07T20:32:40.6759057Z scale_ub=1200.0, 2025-05-07T20:32:40.6759281Z contiguous=True, 2025-05-07T20:32:40.6759585Z compiled=True, 2025-05-07T20:32:40.6759784Z ) 2025-05-07T20:32:40.6760107Z self = 2025-05-07T20:32:40.6760617Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:40.6760916Z 2025-05-07T20:32:40.6760990Z @given( 2025-05-07T20:32:40.6761214Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:40.6761525Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:40.6761832Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:40.6762163Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:40.6762494Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:40.6762778Z ) 2025-05-07T20:32:40.6763130Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:40.6763591Z def test_silu_mul_quant( 2025-05-07T20:32:40.6763827Z self, 2025-05-07T20:32:40.6764015Z T: int, 2025-05-07T20:32:40.6764212Z D: int, 2025-05-07T20:32:40.6764419Z scale_ub: Optional[float], 2025-05-07T20:32:40.6764689Z contiguous: bool, 2025-05-07T20:32:40.6764928Z compiled: bool, 2025-05-07T20:32:40.6765142Z ) -> None: 2025-05-07T20:32:40.6765350Z torch.manual_seed(2025) 2025-05-07T20:32:40.6765589Z 2025-05-07T20:32:40.6765855Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:40.6766253Z 2025-05-07T20:32:40.6766442Z x_sign = torch.sign(x) 2025-05-07T20:32:40.6766722Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:40.6767033Z x = x_sign * x_clamp 2025-05-07T20:32:40.6767270Z x0 = x[:, :D] 2025-05-07T20:32:40.6767481Z x1 = x[:, D:] 2025-05-07T20:32:40.6767724Z 2025-05-07T20:32:40.6767904Z if contiguous: 2025-05-07T20:32:40.6768127Z x0 = x0.contiguous() 2025-05-07T20:32:40.6768381Z x1 = x1.contiguous() 2025-05-07T20:32:40.6768624Z 2025-05-07T20:32:40.6768815Z if scale_ub is not None: 2025-05-07T20:32:40.6769083Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:40.6769417Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:40.6769732Z ) 2025-05-07T20:32:40.6769911Z else: 2025-05-07T20:32:40.6770117Z scale_ub_tensor = None 2025-05-07T20:32:40.6770365Z 2025-05-07T20:32:40.6770588Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:40.6770908Z op = silu_mul_quant 2025-05-07T20:32:40.6771156Z if compiled: 2025-05-07T20:32:40.6771401Z op = torch.compile(op) 2025-05-07T20:32:40.6771700Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:40.6771983Z 2025-05-07T20:32:40.6772167Z > y_fp8, y_scale = fn() 2025-05-07T20:32:40.6772336Z 2025-05-07T20:32:40.6772429Z moe/activation_test.py:117: 2025-05-07T20:32:40.6772728Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:40.6773070Z moe/activation_test.py:115: in fn 2025-05-07T20:32:40.6773346Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:40.6773931Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:40.6774522Z return fn(*args, **kwargs) 2025-05-07T20:32:40.6775216Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:40.6775953Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:40.6776510Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:40.6777238Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:40.6777943Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:40.6778612Z kernel = self.compile( 2025-05-07T20:32:40.6779180Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:40.6779871Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:40.6780272Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:40.6780515Z 2025-05-07T20:32:40.6780727Z self = 2025-05-07T20:32:40.6781889Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:40.6783572Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3dd3a36af0>} 2025-05-07T20:32:40.6785038Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:40.6786193Z context = 2025-05-07T20:32:40.6786500Z 2025-05-07T20:32:40.6786667Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:40.6787280Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:40.6787762Z module_map=module_map) 2025-05-07T20:32:40.6788132Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:40.6788487Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:40.6788806Z E ^ 2025-05-07T20:32:40.6789291Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:40.6789840Z 2025-05-07T20:32:40.6790291Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:40.6790844Z 2025-05-07T20:32:40.9551729Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:40.9553037Z self=, 2025-05-07T20:32:40.9554201Z T=16384, 2025-05-07T20:32:40.9554630Z D=5120, 2025-05-07T20:32:40.9555028Z scale_ub=1200.0, 2025-05-07T20:32:40.9555474Z contiguous=True, 2025-05-07T20:32:40.9555914Z compiled=False, 2025-05-07T20:32:40.9556119Z ) 2025-05-07T20:32:40.9556450Z self = 2025-05-07T20:32:40.9556976Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:40.9557275Z 2025-05-07T20:32:40.9557352Z @given( 2025-05-07T20:32:40.9557587Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:40.9557908Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:40.9558224Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:40.9558560Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:40.9558899Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:40.9559188Z ) 2025-05-07T20:32:40.9559547Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:40.9560016Z def test_silu_mul_quant( 2025-05-07T20:32:40.9560264Z self, 2025-05-07T20:32:40.9560454Z T: int, 2025-05-07T20:32:40.9560649Z D: int, 2025-05-07T20:32:40.9560867Z scale_ub: Optional[float], 2025-05-07T20:32:40.9561137Z contiguous: bool, 2025-05-07T20:32:40.9561380Z compiled: bool, 2025-05-07T20:32:40.9561611Z ) -> None: 2025-05-07T20:32:40.9561827Z torch.manual_seed(2025) 2025-05-07T20:32:40.9562074Z 2025-05-07T20:32:40.9562348Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:40.9562701Z 2025-05-07T20:32:40.9563058Z x_sign = torch.sign(x) 2025-05-07T20:32:40.9563360Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:40.9563673Z x = x_sign * x_clamp 2025-05-07T20:32:40.9563915Z x0 = x[:, :D] 2025-05-07T20:32:40.9564134Z x1 = x[:, D:] 2025-05-07T20:32:40.9564337Z 2025-05-07T20:32:40.9564519Z if contiguous: 2025-05-07T20:32:40.9564752Z x0 = x0.contiguous() 2025-05-07T20:32:40.9565018Z x1 = x1.contiguous() 2025-05-07T20:32:40.9565262Z 2025-05-07T20:32:40.9565451Z if scale_ub is not None: 2025-05-07T20:32:40.9565726Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:40.9566062Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:40.9566387Z ) 2025-05-07T20:32:40.9566577Z else: 2025-05-07T20:32:40.9566781Z scale_ub_tensor = None 2025-05-07T20:32:40.9567040Z 2025-05-07T20:32:40.9567280Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:40.9567602Z op = silu_mul_quant 2025-05-07T20:32:40.9567851Z if compiled: 2025-05-07T20:32:40.9568100Z op = torch.compile(op) 2025-05-07T20:32:40.9568397Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:40.9568677Z 2025-05-07T20:32:40.9568867Z > y_fp8, y_scale = fn() 2025-05-07T20:32:40.9569034Z 2025-05-07T20:32:40.9569209Z moe/activation_test.py:117: 2025-05-07T20:32:40.9569508Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:40.9569860Z moe/activation_test.py:115: in fn 2025-05-07T20:32:40.9570154Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:40.9570890Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:40.9571693Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:40.9572258Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:40.9572981Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:40.9573692Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:40.9574254Z kernel = self.compile( 2025-05-07T20:32:40.9574818Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:40.9575510Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:40.9575924Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:40.9576169Z 2025-05-07T20:32:40.9576379Z self = 2025-05-07T20:32:40.9577555Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:40.9579050Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3dd3a04550>} 2025-05-07T20:32:40.9580510Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:40.9581613Z context = 2025-05-07T20:32:40.9581916Z 2025-05-07T20:32:40.9582085Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:40.9582631Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:40.9583309Z module_map=module_map) 2025-05-07T20:32:40.9583805Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:40.9584170Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:40.9584434Z E ^ 2025-05-07T20:32:40.9584919Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:40.9585406Z 2025-05-07T20:32:40.9585853Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:40.9592172Z 2025-05-07T20:32:40.9592302Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:40.9592756Z self=, 2025-05-07T20:32:40.9593187Z T=1, 2025-05-07T20:32:40.9593374Z D=7168, 2025-05-07T20:32:40.9593574Z scale_ub=1200.0, 2025-05-07T20:32:40.9593805Z contiguous=False, 2025-05-07T20:32:40.9594039Z compiled=False, 2025-05-07T20:32:40.9594252Z ) 2025-05-07T20:32:40.9594583Z self = 2025-05-07T20:32:40.9595110Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:40.9595402Z 2025-05-07T20:32:40.9595481Z @given( 2025-05-07T20:32:40.9595714Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:40.9596035Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:40.9596350Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:40.9596794Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:40.9597128Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:40.9597420Z ) 2025-05-07T20:32:40.9597782Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:40.9598249Z def test_silu_mul_quant( 2025-05-07T20:32:40.9598556Z self, 2025-05-07T20:32:40.9598747Z T: int, 2025-05-07T20:32:40.9598938Z D: int, 2025-05-07T20:32:40.9599148Z scale_ub: Optional[float], 2025-05-07T20:32:40.9599433Z contiguous: bool, 2025-05-07T20:32:40.9599674Z compiled: bool, 2025-05-07T20:32:40.9599895Z ) -> None: 2025-05-07T20:32:40.9600111Z torch.manual_seed(2025) 2025-05-07T20:32:40.9600355Z 2025-05-07T20:32:40.9600625Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:40.9600985Z 2025-05-07T20:32:40.9601172Z x_sign = torch.sign(x) 2025-05-07T20:32:40.9601462Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:40.9601784Z x = x_sign * x_clamp 2025-05-07T20:32:40.9602024Z x0 = x[:, :D] 2025-05-07T20:32:40.9602234Z x1 = x[:, D:] 2025-05-07T20:32:40.9602439Z 2025-05-07T20:32:40.9602622Z if contiguous: 2025-05-07T20:32:40.9602848Z x0 = x0.contiguous() 2025-05-07T20:32:40.9603108Z x1 = x1.contiguous() 2025-05-07T20:32:40.9603351Z 2025-05-07T20:32:40.9603539Z if scale_ub is not None: 2025-05-07T20:32:40.9603814Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:40.9604154Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:40.9604477Z ) 2025-05-07T20:32:40.9604665Z else: 2025-05-07T20:32:40.9604866Z scale_ub_tensor = None 2025-05-07T20:32:40.9605116Z 2025-05-07T20:32:40.9605347Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:40.9605670Z op = silu_mul_quant 2025-05-07T20:32:40.9605926Z if compiled: 2025-05-07T20:32:40.9606170Z op = torch.compile(op) 2025-05-07T20:32:40.9606466Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:40.9606744Z 2025-05-07T20:32:40.9606928Z > y_fp8, y_scale = fn() 2025-05-07T20:32:40.9607095Z 2025-05-07T20:32:40.9607193Z moe/activation_test.py:117: 2025-05-07T20:32:40.9607496Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:40.9607844Z moe/activation_test.py:115: in fn 2025-05-07T20:32:40.9608213Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:40.9608949Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:40.9609687Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:40.9610247Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:40.9610969Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:40.9611672Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:40.9612234Z kernel = self.compile( 2025-05-07T20:32:40.9612798Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:40.9613487Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:40.9613905Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:40.9614235Z 2025-05-07T20:32:40.9614539Z self = 2025-05-07T20:32:40.9615956Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:40.9617533Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3dd3a04820>} 2025-05-07T20:32:40.9618995Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:40.9620136Z context = 2025-05-07T20:32:40.9620441Z 2025-05-07T20:32:40.9620619Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:40.9621160Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:40.9621649Z module_map=module_map) 2025-05-07T20:32:40.9622023Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:40.9622381Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:40.9622644Z E ^ 2025-05-07T20:32:40.9623128Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:40.9623611Z 2025-05-07T20:32:40.9624062Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:40.9624618Z 2025-05-07T20:32:40.9624723Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:40.9625139Z self=, 2025-05-07T20:32:40.9625560Z T=4096, 2025-05-07T20:32:40.9625740Z D=7168, 2025-05-07T20:32:40.9625924Z scale_ub=1200.0, 2025-05-07T20:32:40.9626167Z contiguous=False, 2025-05-07T20:32:40.9626410Z compiled=True, 2025-05-07T20:32:40.9626605Z ) 2025-05-07T20:32:41.0798932Z self = 2025-05-07T20:32:41.0799749Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:41.0800185Z 2025-05-07T20:32:41.0800300Z @given( 2025-05-07T20:32:41.0800605Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:41.0801013Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:41.0801325Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:41.0801652Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:41.0801992Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:41.0802279Z ) 2025-05-07T20:32:41.0802837Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:41.0803303Z def test_silu_mul_quant( 2025-05-07T20:32:41.0803543Z self, 2025-05-07T20:32:41.0803730Z T: int, 2025-05-07T20:32:41.0803917Z D: int, 2025-05-07T20:32:41.0804131Z scale_ub: Optional[float], 2025-05-07T20:32:41.0804401Z contiguous: bool, 2025-05-07T20:32:41.0804633Z compiled: bool, 2025-05-07T20:32:41.0804857Z ) -> None: 2025-05-07T20:32:41.0805071Z torch.manual_seed(2025) 2025-05-07T20:32:41.0805308Z 2025-05-07T20:32:41.0805576Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:41.0805924Z 2025-05-07T20:32:41.0806107Z x_sign = torch.sign(x) 2025-05-07T20:32:41.0806395Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:41.0806712Z x = x_sign * x_clamp 2025-05-07T20:32:41.0806946Z x0 = x[:, :D] 2025-05-07T20:32:41.0807169Z x1 = x[:, D:] 2025-05-07T20:32:41.0807376Z 2025-05-07T20:32:41.0807565Z if contiguous: 2025-05-07T20:32:41.0807793Z x0 = x0.contiguous() 2025-05-07T20:32:41.0808051Z x1 = x1.contiguous() 2025-05-07T20:32:41.0808293Z 2025-05-07T20:32:41.0808474Z if scale_ub is not None: 2025-05-07T20:32:41.0808749Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:41.0809087Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:41.0809473Z ) 2025-05-07T20:32:41.0809671Z else: 2025-05-07T20:32:41.0809881Z scale_ub_tensor = None 2025-05-07T20:32:41.0810125Z 2025-05-07T20:32:41.0810352Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:41.0810672Z op = silu_mul_quant 2025-05-07T20:32:41.0810917Z if compiled: 2025-05-07T20:32:41.0811224Z op = torch.compile(op) 2025-05-07T20:32:41.0811525Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:41.0811803Z 2025-05-07T20:32:41.0811998Z > y_fp8, y_scale = fn() 2025-05-07T20:32:41.0812169Z 2025-05-07T20:32:41.0812269Z moe/activation_test.py:117: 2025-05-07T20:32:41.0812561Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:41.0812898Z moe/activation_test.py:115: in fn 2025-05-07T20:32:41.0813179Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:41.0813768Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:41.0814361Z return fn(*args, **kwargs) 2025-05-07T20:32:41.0815064Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:41.0815805Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:41.0816375Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:41.0817124Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:41.0817823Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:41.0818386Z kernel = self.compile( 2025-05-07T20:32:41.0818950Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:41.0819639Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:41.0820053Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:41.0820298Z 2025-05-07T20:32:41.0820509Z self = 2025-05-07T20:32:41.0821675Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:41.0823266Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3dd3971ca0>} 2025-05-07T20:32:41.0824729Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:41.0825826Z context = 2025-05-07T20:32:41.0826133Z 2025-05-07T20:32:41.0826300Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:41.0826842Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:41.0827324Z module_map=module_map) 2025-05-07T20:32:41.0827697Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:41.0828056Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:41.0828312Z E ^ 2025-05-07T20:32:41.0828800Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:41.0829290Z 2025-05-07T20:32:41.0829891Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:41.0830446Z 2025-05-07T20:32:41.0830554Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:41.0831016Z self=, 2025-05-07T20:32:41.0831427Z T=128, 2025-05-07T20:32:41.0831611Z D=7168, 2025-05-07T20:32:41.0831796Z scale_ub=1200.0, 2025-05-07T20:32:41.0832017Z contiguous=False, 2025-05-07T20:32:41.0832234Z compiled=True, 2025-05-07T20:32:41.0832430Z ) 2025-05-07T20:32:41.0832792Z self = 2025-05-07T20:32:41.0833309Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:41.0833590Z 2025-05-07T20:32:41.0833677Z @given( 2025-05-07T20:32:41.0833895Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:41.0834208Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:41.0834517Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:41.0834850Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:41.0835184Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:41.0835476Z ) 2025-05-07T20:32:41.0835829Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:41.0836291Z def test_silu_mul_quant( 2025-05-07T20:32:41.0836529Z self, 2025-05-07T20:32:41.0836717Z T: int, 2025-05-07T20:32:41.0836903Z D: int, 2025-05-07T20:32:41.0837129Z scale_ub: Optional[float], 2025-05-07T20:32:41.0837393Z contiguous: bool, 2025-05-07T20:32:41.0837630Z compiled: bool, 2025-05-07T20:32:41.0837852Z ) -> None: 2025-05-07T20:32:41.0838062Z torch.manual_seed(2025) 2025-05-07T20:32:41.0838303Z 2025-05-07T20:32:41.0838569Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:41.0838917Z 2025-05-07T20:32:41.0839104Z x_sign = torch.sign(x) 2025-05-07T20:32:41.0839393Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:41.0839699Z x = x_sign * x_clamp 2025-05-07T20:32:41.0839947Z x0 = x[:, :D] 2025-05-07T20:32:41.0840160Z x1 = x[:, D:] 2025-05-07T20:32:41.0840361Z 2025-05-07T20:32:41.0840544Z if contiguous: 2025-05-07T20:32:41.0840769Z x0 = x0.contiguous() 2025-05-07T20:32:41.0841021Z x1 = x1.contiguous() 2025-05-07T20:32:41.0841260Z 2025-05-07T20:32:41.0841445Z if scale_ub is not None: 2025-05-07T20:32:41.0841717Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:41.0842049Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:41.0842366Z ) 2025-05-07T20:32:41.0842637Z else: 2025-05-07T20:32:41.0842839Z scale_ub_tensor = None 2025-05-07T20:32:41.0843088Z 2025-05-07T20:32:41.0843313Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:41.0843627Z op = silu_mul_quant 2025-05-07T20:32:41.0843884Z if compiled: 2025-05-07T20:32:41.0844124Z op = torch.compile(op) 2025-05-07T20:32:41.0844419Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:41.0844696Z 2025-05-07T20:32:41.0844887Z > y_fp8, y_scale = fn() 2025-05-07T20:32:41.0845051Z 2025-05-07T20:32:41.0845146Z moe/activation_test.py:117: 2025-05-07T20:32:41.0845448Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:41.0845791Z moe/activation_test.py:115: in fn 2025-05-07T20:32:41.0846072Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:41.0846657Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:41.0847249Z return fn(*args, **kwargs) 2025-05-07T20:32:41.0847946Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:41.0848679Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:41.0849239Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:41.0850012Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:41.0850713Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:41.0851267Z kernel = self.compile( 2025-05-07T20:32:41.0851832Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:41.0852564Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:41.0852977Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:41.0853216Z 2025-05-07T20:32:41.0853429Z self = 2025-05-07T20:32:41.0854585Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:41.0856106Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3dd3e2e8b0>} 2025-05-07T20:32:41.0857609Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:41.0858713Z context = 2025-05-07T20:32:41.0859026Z 2025-05-07T20:32:41.0859194Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:41.0859739Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:41.0860220Z module_map=module_map) 2025-05-07T20:32:41.0860587Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:41.0860946Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:41.0861206Z E ^ 2025-05-07T20:32:41.0861688Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:41.0862176Z 2025-05-07T20:32:41.0862620Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:41.0863179Z 2025-05-07T20:32:41.2634174Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:41.2635806Z self=, 2025-05-07T20:32:41.2636597Z T=2048, 2025-05-07T20:32:41.2636843Z D=7168, 2025-05-07T20:32:41.2637032Z scale_ub=None, 2025-05-07T20:32:41.2637243Z contiguous=True, 2025-05-07T20:32:41.2637461Z compiled=True, 2025-05-07T20:32:41.2637662Z ) 2025-05-07T20:32:41.2637985Z self = 2025-05-07T20:32:41.2638505Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:41.2638788Z 2025-05-07T20:32:41.2638862Z @given( 2025-05-07T20:32:41.2639088Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:41.2639402Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:41.2639710Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:41.2640051Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:41.2640389Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:41.2640674Z ) 2025-05-07T20:32:41.2641038Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:41.2641499Z def test_silu_mul_quant( 2025-05-07T20:32:41.2641744Z self, 2025-05-07T20:32:41.2641927Z T: int, 2025-05-07T20:32:41.2642122Z D: int, 2025-05-07T20:32:41.2642337Z scale_ub: Optional[float], 2025-05-07T20:32:41.2642603Z contiguous: bool, 2025-05-07T20:32:41.2642918Z compiled: bool, 2025-05-07T20:32:41.2643143Z ) -> None: 2025-05-07T20:32:41.2643356Z torch.manual_seed(2025) 2025-05-07T20:32:41.2643607Z 2025-05-07T20:32:41.2643882Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:41.2644233Z 2025-05-07T20:32:41.2644425Z x_sign = torch.sign(x) 2025-05-07T20:32:41.2644806Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:41.2645123Z x = x_sign * x_clamp 2025-05-07T20:32:41.2645367Z x0 = x[:, :D] 2025-05-07T20:32:41.2645587Z x1 = x[:, D:] 2025-05-07T20:32:41.2645790Z 2025-05-07T20:32:41.2645973Z if contiguous: 2025-05-07T20:32:41.2646206Z x0 = x0.contiguous() 2025-05-07T20:32:41.2646461Z x1 = x1.contiguous() 2025-05-07T20:32:41.2646700Z 2025-05-07T20:32:41.2646888Z if scale_ub is not None: 2025-05-07T20:32:41.2647158Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:41.2647495Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:41.2647814Z ) 2025-05-07T20:32:41.2648003Z else: 2025-05-07T20:32:41.2648205Z scale_ub_tensor = None 2025-05-07T20:32:41.2648457Z 2025-05-07T20:32:41.2648686Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:41.2649000Z op = silu_mul_quant 2025-05-07T20:32:41.2649250Z if compiled: 2025-05-07T20:32:41.2649490Z op = torch.compile(op) 2025-05-07T20:32:41.2649781Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:41.2650065Z 2025-05-07T20:32:41.2650253Z > y_fp8, y_scale = fn() 2025-05-07T20:32:41.2650416Z 2025-05-07T20:32:41.2650512Z moe/activation_test.py:117: 2025-05-07T20:32:41.2650806Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:41.2651146Z moe/activation_test.py:115: in fn 2025-05-07T20:32:41.2651429Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:41.2652010Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:41.2652600Z return fn(*args, **kwargs) 2025-05-07T20:32:41.2653299Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:41.2654030Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:41.2654590Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:41.2655399Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:41.2656105Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:41.2656661Z kernel = self.compile( 2025-05-07T20:32:41.2657228Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:41.2657922Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:41.2658322Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:41.2658563Z 2025-05-07T20:32:41.2658773Z self = 2025-05-07T20:32:41.2659937Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:41.2661447Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3dd379e550>} 2025-05-07T20:32:41.2662903Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:41.2664044Z context = 2025-05-07T20:32:41.2664352Z 2025-05-07T20:32:41.2664518Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:41.2665062Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:41.2665589Z module_map=module_map) 2025-05-07T20:32:41.2665955Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:41.2666308Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:41.2666579Z E ^ 2025-05-07T20:32:41.2667060Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:41.2667550Z 2025-05-07T20:32:41.2668017Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:41.2668574Z 2025-05-07T20:32:41.2668681Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:41.2669101Z self=, 2025-05-07T20:32:41.2669517Z T=16384, 2025-05-07T20:32:41.2669808Z D=5120, 2025-05-07T20:32:41.2670024Z scale_ub=None, 2025-05-07T20:32:41.2670233Z contiguous=False, 2025-05-07T20:32:41.2670451Z compiled=False, 2025-05-07T20:32:41.2670651Z ) 2025-05-07T20:32:41.2670972Z self = 2025-05-07T20:32:41.2671494Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:41.2671790Z 2025-05-07T20:32:41.2671865Z @given( 2025-05-07T20:32:41.2672088Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:41.2672405Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:41.2672712Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:41.2673046Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:41.2673386Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:41.2673672Z ) 2025-05-07T20:32:41.2674031Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:41.2674490Z def test_silu_mul_quant( 2025-05-07T20:32:41.2674725Z self, 2025-05-07T20:32:41.2674912Z T: int, 2025-05-07T20:32:41.2675107Z D: int, 2025-05-07T20:32:41.2675312Z scale_ub: Optional[float], 2025-05-07T20:32:41.2675590Z contiguous: bool, 2025-05-07T20:32:41.2675825Z compiled: bool, 2025-05-07T20:32:41.2676123Z ) -> None: 2025-05-07T20:32:41.2676340Z torch.manual_seed(2025) 2025-05-07T20:32:41.2676577Z 2025-05-07T20:32:41.2676845Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:41.2677192Z 2025-05-07T20:32:41.2677378Z x_sign = torch.sign(x) 2025-05-07T20:32:41.2677675Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:41.2679880Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:41.2681963Z 2025-05-07T20:32:41.2682084Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:41.2682301Z 2025-05-07T20:32:41.2682403Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:41.2683078Z self=, 2025-05-07T20:32:41.2683505Z T=4096, 2025-05-07T20:32:41.2683681Z D=7168, 2025-05-07T20:32:41.2683867Z scale_ub=1200.0, 2025-05-07T20:32:41.2684164Z contiguous=True, 2025-05-07T20:32:41.2684378Z compiled=True, 2025-05-07T20:32:41.2684574Z ) 2025-05-07T20:32:41.2684896Z self = 2025-05-07T20:32:41.2685409Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:41.2685693Z 2025-05-07T20:32:41.2685833Z @given( 2025-05-07T20:32:41.2686065Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:41.2686431Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:41.2686738Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:41.2692917Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:41.2693276Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:41.2693571Z ) 2025-05-07T20:32:41.2693940Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:41.2694425Z def test_silu_mul_quant( 2025-05-07T20:32:41.2694676Z self, 2025-05-07T20:32:41.2694876Z T: int, 2025-05-07T20:32:41.2695076Z D: int, 2025-05-07T20:32:41.2695296Z scale_ub: Optional[float], 2025-05-07T20:32:41.2695573Z contiguous: bool, 2025-05-07T20:32:41.2695821Z compiled: bool, 2025-05-07T20:32:41.2696054Z ) -> None: 2025-05-07T20:32:41.2696272Z torch.manual_seed(2025) 2025-05-07T20:32:41.2696525Z 2025-05-07T20:32:41.2696803Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:41.2697162Z 2025-05-07T20:32:41.2697362Z x_sign = torch.sign(x) 2025-05-07T20:32:41.2697659Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:41.2699889Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:41.2701960Z 2025-05-07T20:32:41.2702078Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:41.2702307Z 2025-05-07T20:32:41.2702413Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:41.2702835Z self=, 2025-05-07T20:32:41.2703417Z T=16384, 2025-05-07T20:32:41.2703611Z D=7168, 2025-05-07T20:32:41.2703810Z scale_ub=None, 2025-05-07T20:32:41.2704031Z contiguous=False, 2025-05-07T20:32:41.2704254Z compiled=False, 2025-05-07T20:32:41.2704461Z ) 2025-05-07T20:32:41.3754343Z self = 2025-05-07T20:32:41.3755137Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:41.3755541Z 2025-05-07T20:32:41.3755625Z @given( 2025-05-07T20:32:41.3755854Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:41.3756177Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:41.3756490Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:41.3756827Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:41.3757160Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:41.3757459Z ) 2025-05-07T20:32:41.3757820Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:41.3758284Z def test_silu_mul_quant( 2025-05-07T20:32:41.3758530Z self, 2025-05-07T20:32:41.3758719Z T: int, 2025-05-07T20:32:41.3758916Z D: int, 2025-05-07T20:32:41.3759135Z scale_ub: Optional[float], 2025-05-07T20:32:41.3759405Z contiguous: bool, 2025-05-07T20:32:41.3759651Z compiled: bool, 2025-05-07T20:32:41.3760007Z ) -> None: 2025-05-07T20:32:41.3760215Z torch.manual_seed(2025) 2025-05-07T20:32:41.3760459Z 2025-05-07T20:32:41.3760728Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:41.3762996Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:41.3765151Z 2025-05-07T20:32:41.3765266Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:41.3765482Z 2025-05-07T20:32:41.3765589Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:41.3766012Z self=, 2025-05-07T20:32:41.3766429Z T=2048, 2025-05-07T20:32:41.3766612Z D=7168, 2025-05-07T20:32:41.3766795Z scale_ub=1200.0, 2025-05-07T20:32:41.3767009Z contiguous=True, 2025-05-07T20:32:41.3767218Z compiled=True, 2025-05-07T20:32:41.3767412Z ) 2025-05-07T20:32:41.3767738Z self = 2025-05-07T20:32:41.3768249Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:41.3768537Z 2025-05-07T20:32:41.3768616Z @given( 2025-05-07T20:32:41.3768837Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:41.3769149Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:41.3769455Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:41.3769783Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:41.3770113Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:41.3770398Z ) 2025-05-07T20:32:41.3770749Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:41.3771206Z def test_silu_mul_quant( 2025-05-07T20:32:41.3771447Z self, 2025-05-07T20:32:41.3771633Z T: int, 2025-05-07T20:32:41.3771831Z D: int, 2025-05-07T20:32:41.3772050Z scale_ub: Optional[float], 2025-05-07T20:32:41.3772316Z contiguous: bool, 2025-05-07T20:32:41.3772547Z compiled: bool, 2025-05-07T20:32:41.3772759Z ) -> None: 2025-05-07T20:32:41.3773098Z torch.manual_seed(2025) 2025-05-07T20:32:41.3773337Z 2025-05-07T20:32:41.3773604Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:41.3773946Z 2025-05-07T20:32:41.3774127Z x_sign = torch.sign(x) 2025-05-07T20:32:41.3774413Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:41.3776641Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:41.3778694Z 2025-05-07T20:32:41.3778820Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:41.3779036Z 2025-05-07T20:32:41.3779140Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:41.3779555Z self=, 2025-05-07T20:32:41.3779966Z T=2048, 2025-05-07T20:32:41.3780139Z D=7168, 2025-05-07T20:32:41.3780317Z scale_ub=None, 2025-05-07T20:32:41.3780522Z contiguous=True, 2025-05-07T20:32:41.3780788Z compiled=False, 2025-05-07T20:32:41.3780986Z ) 2025-05-07T20:32:41.3781310Z self = 2025-05-07T20:32:41.3781824Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:41.3782107Z 2025-05-07T20:32:41.3782183Z @given( 2025-05-07T20:32:41.3782448Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:41.3782942Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:41.3783250Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:41.3783583Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:41.3783913Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:41.3784201Z ) 2025-05-07T20:32:41.3784553Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:41.3785008Z def test_silu_mul_quant( 2025-05-07T20:32:41.3785245Z self, 2025-05-07T20:32:41.3785427Z T: int, 2025-05-07T20:32:41.3785615Z D: int, 2025-05-07T20:32:41.3785826Z scale_ub: Optional[float], 2025-05-07T20:32:41.3786089Z contiguous: bool, 2025-05-07T20:32:41.3786322Z compiled: bool, 2025-05-07T20:32:41.3786564Z ) -> None: 2025-05-07T20:32:41.3786796Z torch.manual_seed(2025) 2025-05-07T20:32:41.3787042Z 2025-05-07T20:32:41.3787302Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:41.3787650Z 2025-05-07T20:32:41.3787831Z > x_sign = torch.sign(x) 2025-05-07T20:32:41.3790055Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:41.3792121Z 2025-05-07T20:32:41.3792237Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:32:41.3792453Z 2025-05-07T20:32:41.3792558Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:41.3792975Z self=, 2025-05-07T20:32:41.3793390Z T=1, 2025-05-07T20:32:41.3793566Z D=7168, 2025-05-07T20:32:41.3793746Z scale_ub=1200.0, 2025-05-07T20:32:41.3794085Z contiguous=True, 2025-05-07T20:32:41.3794302Z compiled=False, 2025-05-07T20:32:41.3794500Z ) 2025-05-07T20:32:41.5358626Z self = 2025-05-07T20:32:41.5359449Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:41.5359828Z 2025-05-07T20:32:41.5359935Z @given( 2025-05-07T20:32:41.5360248Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:41.5360572Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:41.5360877Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:41.5361216Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:41.5361556Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:41.5361850Z ) 2025-05-07T20:32:41.5362209Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:41.5362679Z def test_silu_mul_quant( 2025-05-07T20:32:41.5362937Z self, 2025-05-07T20:32:41.5363129Z T: int, 2025-05-07T20:32:41.5363331Z D: int, 2025-05-07T20:32:41.5363549Z scale_ub: Optional[float], 2025-05-07T20:32:41.5363817Z contiguous: bool, 2025-05-07T20:32:41.5364060Z compiled: bool, 2025-05-07T20:32:41.5364284Z ) -> None: 2025-05-07T20:32:41.5364491Z torch.manual_seed(2025) 2025-05-07T20:32:41.5364748Z 2025-05-07T20:32:41.5365160Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:41.5365510Z 2025-05-07T20:32:41.5365694Z x_sign = torch.sign(x) 2025-05-07T20:32:41.5365983Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:41.5366302Z x = x_sign * x_clamp 2025-05-07T20:32:41.5366624Z x0 = x[:, :D] 2025-05-07T20:32:41.5366834Z x1 = x[:, D:] 2025-05-07T20:32:41.5367036Z 2025-05-07T20:32:41.5367216Z if contiguous: 2025-05-07T20:32:41.5367443Z x0 = x0.contiguous() 2025-05-07T20:32:41.5367705Z x1 = x1.contiguous() 2025-05-07T20:32:41.5367937Z 2025-05-07T20:32:41.5368121Z if scale_ub is not None: 2025-05-07T20:32:41.5368392Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:41.5368727Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:41.5369041Z ) 2025-05-07T20:32:41.5369231Z else: 2025-05-07T20:32:41.5369433Z scale_ub_tensor = None 2025-05-07T20:32:41.5369683Z 2025-05-07T20:32:41.5369908Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:41.5370222Z op = silu_mul_quant 2025-05-07T20:32:41.5370470Z if compiled: 2025-05-07T20:32:41.5370713Z op = torch.compile(op) 2025-05-07T20:32:41.5371027Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:41.5371305Z 2025-05-07T20:32:41.5371489Z > y_fp8, y_scale = fn() 2025-05-07T20:32:41.5371656Z 2025-05-07T20:32:41.5371752Z moe/activation_test.py:117: 2025-05-07T20:32:41.5372051Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:41.5372387Z moe/activation_test.py:115: in fn 2025-05-07T20:32:41.5372668Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:41.5373400Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:41.5374138Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:41.5374701Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:41.5375426Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:41.5376130Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:41.5376688Z kernel = self.compile( 2025-05-07T20:32:41.5377384Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:41.5378083Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:41.5378487Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:41.5378726Z 2025-05-07T20:32:41.5378935Z self = 2025-05-07T20:32:41.5380100Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:41.5381600Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3dd35eb0d0>} 2025-05-07T20:32:41.5383250Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:41.5384354Z context = 2025-05-07T20:32:41.5384659Z 2025-05-07T20:32:41.5384828Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:41.5385374Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:41.5385933Z module_map=module_map) 2025-05-07T20:32:41.5386299Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:41.5386656Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:41.5386915Z E ^ 2025-05-07T20:32:41.5387402Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:41.5387956Z 2025-05-07T20:32:41.5388401Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:41.5388961Z 2025-05-07T20:32:41.5389059Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:41.5389481Z self=, 2025-05-07T20:32:41.5390019Z T=128, 2025-05-07T20:32:41.5390201Z D=5120, 2025-05-07T20:32:41.5390385Z scale_ub=None, 2025-05-07T20:32:41.5390590Z contiguous=True, 2025-05-07T20:32:41.5390806Z compiled=False, 2025-05-07T20:32:41.5391005Z ) 2025-05-07T20:32:41.5391316Z self = 2025-05-07T20:32:41.5391829Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:41.5392109Z 2025-05-07T20:32:41.5392185Z @given( 2025-05-07T20:32:41.5392411Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:41.5392724Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:41.5393033Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:41.5393374Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:41.5393706Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:41.5393995Z ) 2025-05-07T20:32:41.5394351Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:41.5394806Z def test_silu_mul_quant( 2025-05-07T20:32:41.5395044Z self, 2025-05-07T20:32:41.5395233Z T: int, 2025-05-07T20:32:41.5395427Z D: int, 2025-05-07T20:32:41.5395634Z scale_ub: Optional[float], 2025-05-07T20:32:41.5395904Z contiguous: bool, 2025-05-07T20:32:41.5396139Z compiled: bool, 2025-05-07T20:32:41.5396348Z ) -> None: 2025-05-07T20:32:41.5396556Z torch.manual_seed(2025) 2025-05-07T20:32:41.5396793Z 2025-05-07T20:32:41.5397059Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:41.5397410Z 2025-05-07T20:32:41.5397596Z x_sign = torch.sign(x) 2025-05-07T20:32:41.5398012Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:41.5398329Z x = x_sign * x_clamp 2025-05-07T20:32:41.5398572Z x0 = x[:, :D] 2025-05-07T20:32:41.5398781Z x1 = x[:, D:] 2025-05-07T20:32:41.5398981Z 2025-05-07T20:32:41.5399162Z if contiguous: 2025-05-07T20:32:41.5399383Z x0 = x0.contiguous() 2025-05-07T20:32:41.5399639Z x1 = x1.contiguous() 2025-05-07T20:32:41.5399880Z 2025-05-07T20:32:41.5400066Z if scale_ub is not None: 2025-05-07T20:32:41.5400342Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:41.5400682Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:41.5401004Z ) 2025-05-07T20:32:41.5401191Z else: 2025-05-07T20:32:41.5401399Z scale_ub_tensor = None 2025-05-07T20:32:41.5401663Z 2025-05-07T20:32:41.5401887Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:41.5402210Z op = silu_mul_quant 2025-05-07T20:32:41.5402469Z if compiled: 2025-05-07T20:32:41.5402712Z op = torch.compile(op) 2025-05-07T20:32:41.5403015Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:41.5403296Z 2025-05-07T20:32:41.5403481Z > y_fp8, y_scale = fn() 2025-05-07T20:32:41.5403650Z 2025-05-07T20:32:41.5403745Z moe/activation_test.py:117: 2025-05-07T20:32:41.5404042Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:41.5404436Z moe/activation_test.py:115: in fn 2025-05-07T20:32:41.5404714Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:41.5405449Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:41.5406195Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:41.5406801Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:41.5407533Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:41.5408240Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:41.5408806Z kernel = self.compile( 2025-05-07T20:32:41.5409371Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:41.5410072Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:41.5410486Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:41.5410727Z 2025-05-07T20:32:41.5410936Z self = 2025-05-07T20:32:41.5412103Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:41.5413610Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3dd35ebaf0>} 2025-05-07T20:32:41.5415081Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:41.5416193Z context = 2025-05-07T20:32:41.5416498Z 2025-05-07T20:32:41.5416667Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:41.5417216Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:41.5417705Z module_map=module_map) 2025-05-07T20:32:41.5418082Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:41.5418440Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:41.5418786Z E ^ 2025-05-07T20:32:41.5419276Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:41.5419765Z 2025-05-07T20:32:41.5420214Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:41.5420774Z 2025-05-07T20:32:41.5420875Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:41.5421306Z self=, 2025-05-07T20:32:41.5421725Z T=128, 2025-05-07T20:32:41.5421906Z D=7168, 2025-05-07T20:32:41.5422094Z scale_ub=None, 2025-05-07T20:32:41.5422306Z contiguous=True, 2025-05-07T20:32:41.5422522Z compiled=False, 2025-05-07T20:32:41.5422726Z ) 2025-05-07T20:32:41.6322897Z self = 2025-05-07T20:32:41.6324409Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:41.6325205Z 2025-05-07T20:32:41.6325402Z @given( 2025-05-07T20:32:41.6325847Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:41.6326312Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:41.6326620Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:41.6326958Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:41.6327292Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:41.6327696Z ) 2025-05-07T20:32:41.6328048Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:41.6328511Z def test_silu_mul_quant( 2025-05-07T20:32:41.6328749Z self, 2025-05-07T20:32:41.6328932Z T: int, 2025-05-07T20:32:41.6329125Z D: int, 2025-05-07T20:32:41.6329413Z scale_ub: Optional[float], 2025-05-07T20:32:41.6329684Z contiguous: bool, 2025-05-07T20:32:41.6329922Z compiled: bool, 2025-05-07T20:32:41.6330146Z ) -> None: 2025-05-07T20:32:41.6330361Z torch.manual_seed(2025) 2025-05-07T20:32:41.6330609Z 2025-05-07T20:32:41.6330882Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:41.6331228Z 2025-05-07T20:32:41.6331413Z x_sign = torch.sign(x) 2025-05-07T20:32:41.6331700Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:41.6332009Z x = x_sign * x_clamp 2025-05-07T20:32:41.6332250Z x0 = x[:, :D] 2025-05-07T20:32:41.6332463Z x1 = x[:, D:] 2025-05-07T20:32:41.6332667Z 2025-05-07T20:32:41.6332845Z if contiguous: 2025-05-07T20:32:41.6333072Z x0 = x0.contiguous() 2025-05-07T20:32:41.6333328Z x1 = x1.contiguous() 2025-05-07T20:32:41.6333562Z 2025-05-07T20:32:41.6333748Z if scale_ub is not None: 2025-05-07T20:32:41.6334022Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:41.6334357Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:41.6334668Z ) 2025-05-07T20:32:41.6334858Z else: 2025-05-07T20:32:41.6335057Z scale_ub_tensor = None 2025-05-07T20:32:41.6335309Z 2025-05-07T20:32:41.6335536Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:41.6335848Z op = silu_mul_quant 2025-05-07T20:32:41.6336102Z if compiled: 2025-05-07T20:32:41.6336343Z op = torch.compile(op) 2025-05-07T20:32:41.6336643Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:41.6336918Z 2025-05-07T20:32:41.6337105Z > y_fp8, y_scale = fn() 2025-05-07T20:32:41.6337271Z 2025-05-07T20:32:41.6337369Z moe/activation_test.py:117: 2025-05-07T20:32:41.6337662Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:41.6338004Z moe/activation_test.py:115: in fn 2025-05-07T20:32:41.6338283Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:41.6339164Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:41.6339913Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:41.6340476Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:41.6341203Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:41.6341901Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:41.6342465Z kernel = self.compile( 2025-05-07T20:32:41.6343033Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:41.6343722Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:41.6344130Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:41.6344374Z 2025-05-07T20:32:41.6344592Z self = 2025-05-07T20:32:41.6345763Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:41.6347262Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3dd36f0550>} 2025-05-07T20:32:41.6348771Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:41.6349987Z context = 2025-05-07T20:32:41.6350337Z 2025-05-07T20:32:41.6350504Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:41.6351045Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:41.6351526Z module_map=module_map) 2025-05-07T20:32:41.6351903Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:41.6352261Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:41.6352515Z E ^ 2025-05-07T20:32:41.6353007Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:41.6353500Z 2025-05-07T20:32:41.6353949Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:41.6354503Z 2025-05-07T20:32:41.6354606Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:41.6355027Z self=, 2025-05-07T20:32:41.6355448Z T=2048, 2025-05-07T20:32:41.6355639Z D=7168, 2025-05-07T20:32:41.6361945Z scale_ub=1200.0, 2025-05-07T20:32:41.6362196Z contiguous=True, 2025-05-07T20:32:41.6362425Z compiled=False, 2025-05-07T20:32:41.6362638Z ) 2025-05-07T20:32:41.6362966Z self = 2025-05-07T20:32:41.6363494Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:41.6363791Z 2025-05-07T20:32:41.6363867Z @given( 2025-05-07T20:32:41.6364099Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:41.6364413Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:41.6364725Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:41.6365064Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:41.6365399Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:41.6365694Z ) 2025-05-07T20:32:41.6366054Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:41.6366515Z def test_silu_mul_quant( 2025-05-07T20:32:41.6366867Z self, 2025-05-07T20:32:41.6367063Z T: int, 2025-05-07T20:32:41.6367256Z D: int, 2025-05-07T20:32:41.6367473Z scale_ub: Optional[float], 2025-05-07T20:32:41.6367749Z contiguous: bool, 2025-05-07T20:32:41.6367988Z compiled: bool, 2025-05-07T20:32:41.6368209Z ) -> None: 2025-05-07T20:32:41.6368423Z torch.manual_seed(2025) 2025-05-07T20:32:41.6368672Z 2025-05-07T20:32:41.6368948Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:41.6371217Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:41.6373280Z 2025-05-07T20:32:41.6373400Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:41.6373622Z 2025-05-07T20:32:41.6373723Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:41.6374146Z self=, 2025-05-07T20:32:41.6374605Z T=1, 2025-05-07T20:32:41.6374785Z D=5120, 2025-05-07T20:32:41.6374973Z scale_ub=1200.0, 2025-05-07T20:32:41.6375191Z contiguous=True, 2025-05-07T20:32:41.6375404Z compiled=False, 2025-05-07T20:32:41.6375603Z ) 2025-05-07T20:32:41.6855287Z self = 2025-05-07T20:32:41.6856650Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:41.6857032Z 2025-05-07T20:32:41.6857146Z @given( 2025-05-07T20:32:41.6857438Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:41.6857769Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:41.6858081Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:41.6858416Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:41.6858758Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:41.6859059Z ) 2025-05-07T20:32:41.6859420Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:41.6859884Z def test_silu_mul_quant( 2025-05-07T20:32:41.6860125Z self, 2025-05-07T20:32:41.6860318Z T: int, 2025-05-07T20:32:41.6860514Z D: int, 2025-05-07T20:32:41.6860735Z scale_ub: Optional[float], 2025-05-07T20:32:41.6861014Z contiguous: bool, 2025-05-07T20:32:41.6861256Z compiled: bool, 2025-05-07T20:32:41.6861484Z ) -> None: 2025-05-07T20:32:41.6861702Z torch.manual_seed(2025) 2025-05-07T20:32:41.6861941Z 2025-05-07T20:32:41.6862220Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:41.6862577Z 2025-05-07T20:32:41.6862768Z x_sign = torch.sign(x) 2025-05-07T20:32:41.6863065Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:41.6863380Z x = x_sign * x_clamp 2025-05-07T20:32:41.6863621Z x0 = x[:, :D] 2025-05-07T20:32:41.6863841Z x1 = x[:, D:] 2025-05-07T20:32:41.6864049Z 2025-05-07T20:32:41.6864235Z if contiguous: 2025-05-07T20:32:41.6864466Z x0 = x0.contiguous() 2025-05-07T20:32:41.6864723Z x1 = x1.contiguous() 2025-05-07T20:32:41.6864969Z 2025-05-07T20:32:41.6865153Z if scale_ub is not None: 2025-05-07T20:32:41.6865434Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:41.6865778Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:41.6866098Z ) 2025-05-07T20:32:41.6866297Z else: 2025-05-07T20:32:41.6866508Z scale_ub_tensor = None 2025-05-07T20:32:41.6866898Z 2025-05-07T20:32:41.6867125Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:41.6867444Z op = silu_mul_quant 2025-05-07T20:32:41.6867686Z if compiled: 2025-05-07T20:32:41.6867926Z op = torch.compile(op) 2025-05-07T20:32:41.6868220Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:41.6868491Z 2025-05-07T20:32:41.6868674Z > y_fp8, y_scale = fn() 2025-05-07T20:32:41.6868845Z 2025-05-07T20:32:41.6868940Z moe/activation_test.py:117: 2025-05-07T20:32:41.6869238Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:41.6869574Z moe/activation_test.py:115: in fn 2025-05-07T20:32:41.6869981Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:41.6870723Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:41.6871456Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:41.6872022Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:41.6872750Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:41.6873452Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:41.6874004Z kernel = self.compile( 2025-05-07T20:32:41.6874638Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:41.6875331Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:41.6875735Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:41.6876021Z 2025-05-07T20:32:41.6876231Z self = 2025-05-07T20:32:41.6877406Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:41.6878905Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3dd37d5280>} 2025-05-07T20:32:41.6880367Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:41.6881466Z context = 2025-05-07T20:32:41.6881770Z 2025-05-07T20:32:41.6881934Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:41.6882484Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:41.6883154Z module_map=module_map) 2025-05-07T20:32:41.6883520Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:41.6883873Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:41.6884130Z E ^ 2025-05-07T20:32:41.6884613Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:41.6885102Z 2025-05-07T20:32:41.6885546Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:41.6886104Z 2025-05-07T20:32:41.6886203Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:41.6886622Z self=, 2025-05-07T20:32:41.6887038Z T=2048, 2025-05-07T20:32:41.6887226Z D=5120, 2025-05-07T20:32:41.6887410Z scale_ub=None, 2025-05-07T20:32:41.6887616Z contiguous=True, 2025-05-07T20:32:41.6887834Z compiled=False, 2025-05-07T20:32:41.6888028Z ) 2025-05-07T20:32:41.6888462Z self = 2025-05-07T20:32:41.6888977Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:41.6889265Z 2025-05-07T20:32:41.6889339Z @given( 2025-05-07T20:32:41.6889561Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:41.6889875Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:41.6890186Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:41.6890524Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:41.6890852Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:41.6891149Z ) 2025-05-07T20:32:41.6891507Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:41.6891961Z def test_silu_mul_quant( 2025-05-07T20:32:41.6892195Z self, 2025-05-07T20:32:41.6892382Z T: int, 2025-05-07T20:32:41.6892563Z D: int, 2025-05-07T20:32:41.6892782Z scale_ub: Optional[float], 2025-05-07T20:32:41.6893050Z contiguous: bool, 2025-05-07T20:32:41.6893287Z compiled: bool, 2025-05-07T20:32:41.6893509Z ) -> None: 2025-05-07T20:32:41.6893717Z torch.manual_seed(2025) 2025-05-07T20:32:41.6893961Z 2025-05-07T20:32:41.6894226Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:41.6894572Z 2025-05-07T20:32:41.6894867Z > x_sign = torch.sign(x) 2025-05-07T20:32:41.6897001Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:41.6899129Z 2025-05-07T20:32:41.6899244Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:32:41.6899467Z 2025-05-07T20:32:41.6899564Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:41.6899985Z self=, 2025-05-07T20:32:41.6900397Z T=16384, 2025-05-07T20:32:41.6900582Z D=5120, 2025-05-07T20:32:41.6900765Z scale_ub=None, 2025-05-07T20:32:41.6900972Z contiguous=True, 2025-05-07T20:32:41.6901182Z compiled=False, 2025-05-07T20:32:41.6901380Z ) 2025-05-07T20:32:41.6901697Z self = 2025-05-07T20:32:41.6902207Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:41.6902504Z 2025-05-07T20:32:41.6902579Z @given( 2025-05-07T20:32:41.6902798Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:41.6903109Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:41.6903419Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:41.6903752Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:41.6904081Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:41.6904367Z ) 2025-05-07T20:32:41.6904716Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:41.6905175Z def test_silu_mul_quant( 2025-05-07T20:32:41.6905405Z self, 2025-05-07T20:32:41.6905592Z T: int, 2025-05-07T20:32:41.6905782Z D: int, 2025-05-07T20:32:41.6905987Z scale_ub: Optional[float], 2025-05-07T20:32:41.6906258Z contiguous: bool, 2025-05-07T20:32:41.6906520Z compiled: bool, 2025-05-07T20:32:41.6906763Z ) -> None: 2025-05-07T20:32:41.6906970Z torch.manual_seed(2025) 2025-05-07T20:32:41.6907203Z 2025-05-07T20:32:41.6907466Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:41.6909878Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:41.6911949Z 2025-05-07T20:32:41.6912064Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:41.6912285Z 2025-05-07T20:32:41.6912382Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:41.6912803Z self=, 2025-05-07T20:32:41.6913217Z T=4096, 2025-05-07T20:32:41.6913402Z D=5120, 2025-05-07T20:32:41.6913585Z scale_ub=None, 2025-05-07T20:32:41.6913788Z contiguous=True, 2025-05-07T20:32:41.6914003Z compiled=False, 2025-05-07T20:32:41.6914200Z ) 2025-05-07T20:32:41.9884362Z self = 2025-05-07T20:32:41.9885214Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:41.9885619Z 2025-05-07T20:32:41.9885729Z @given( 2025-05-07T20:32:41.9886100Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:41.9886435Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:41.9886770Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:41.9887101Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:41.9887431Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:41.9887788Z ) 2025-05-07T20:32:41.9888138Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:41.9888597Z def test_silu_mul_quant( 2025-05-07T20:32:41.9888838Z self, 2025-05-07T20:32:41.9889025Z T: int, 2025-05-07T20:32:41.9889218Z D: int, 2025-05-07T20:32:41.9889424Z scale_ub: Optional[float], 2025-05-07T20:32:41.9889698Z contiguous: bool, 2025-05-07T20:32:41.9889933Z compiled: bool, 2025-05-07T20:32:41.9890151Z ) -> None: 2025-05-07T20:32:41.9890362Z torch.manual_seed(2025) 2025-05-07T20:32:41.9890606Z 2025-05-07T20:32:41.9890870Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:41.9893114Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:41.9895173Z 2025-05-07T20:32:41.9895288Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:41.9895507Z 2025-05-07T20:32:41.9895606Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:41.9896026Z self=, 2025-05-07T20:32:41.9896441Z T=2048, 2025-05-07T20:32:41.9896619Z D=5120, 2025-05-07T20:32:41.9896801Z scale_ub=None, 2025-05-07T20:32:41.9897009Z contiguous=False, 2025-05-07T20:32:41.9897232Z compiled=False, 2025-05-07T20:32:41.9897427Z ) 2025-05-07T20:32:41.9897741Z self = 2025-05-07T20:32:41.9898255Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:41.9898544Z 2025-05-07T20:32:41.9898617Z @given( 2025-05-07T20:32:41.9898957Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:41.9899269Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:41.9899576Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:41.9899911Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:41.9900236Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:41.9900523Z ) 2025-05-07T20:32:41.9900882Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:41.9901340Z def test_silu_mul_quant( 2025-05-07T20:32:41.9901577Z self, 2025-05-07T20:32:41.9901764Z T: int, 2025-05-07T20:32:41.9901955Z D: int, 2025-05-07T20:32:41.9902160Z scale_ub: Optional[float], 2025-05-07T20:32:41.9902430Z contiguous: bool, 2025-05-07T20:32:41.9902667Z compiled: bool, 2025-05-07T20:32:41.9902879Z ) -> None: 2025-05-07T20:32:41.9903091Z torch.manual_seed(2025) 2025-05-07T20:32:41.9903330Z 2025-05-07T20:32:41.9903601Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:41.9905828Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:41.9908003Z 2025-05-07T20:32:41.9908121Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:41.9908384Z 2025-05-07T20:32:41.9908482Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:41.9908904Z self=, 2025-05-07T20:32:41.9909316Z T=4096, 2025-05-07T20:32:41.9909492Z D=7168, 2025-05-07T20:32:41.9909676Z scale_ub=None, 2025-05-07T20:32:41.9910013Z contiguous=True, 2025-05-07T20:32:41.9910232Z compiled=True, 2025-05-07T20:32:41.9910429Z ) 2025-05-07T20:32:41.9910742Z self = 2025-05-07T20:32:41.9911252Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:41.9911535Z 2025-05-07T20:32:41.9911612Z @given( 2025-05-07T20:32:41.9911826Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:41.9912141Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:41.9912447Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:41.9912781Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:41.9913109Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:41.9913395Z ) 2025-05-07T20:32:41.9913752Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:41.9914210Z def test_silu_mul_quant( 2025-05-07T20:32:41.9914451Z self, 2025-05-07T20:32:41.9914636Z T: int, 2025-05-07T20:32:41.9914821Z D: int, 2025-05-07T20:32:41.9915032Z scale_ub: Optional[float], 2025-05-07T20:32:41.9915300Z contiguous: bool, 2025-05-07T20:32:41.9915530Z compiled: bool, 2025-05-07T20:32:41.9915745Z ) -> None: 2025-05-07T20:32:41.9915958Z torch.manual_seed(2025) 2025-05-07T20:32:41.9916192Z 2025-05-07T20:32:41.9916459Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:41.9918767Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:41.9920826Z 2025-05-07T20:32:41.9920941Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:41.9921160Z 2025-05-07T20:32:41.9921263Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:41.9921679Z self=, 2025-05-07T20:32:41.9922093Z T=2048, 2025-05-07T20:32:41.9922272Z D=5120, 2025-05-07T20:32:41.9922454Z scale_ub=1200.0, 2025-05-07T20:32:41.9922673Z contiguous=False, 2025-05-07T20:32:41.9922892Z compiled=False, 2025-05-07T20:32:41.9923085Z ) 2025-05-07T20:32:41.9923408Z self = 2025-05-07T20:32:41.9923921Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:41.9924211Z 2025-05-07T20:32:41.9924292Z @given( 2025-05-07T20:32:41.9924511Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:41.9924825Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:41.9925134Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:41.9925463Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:41.9925799Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:41.9926134Z ) 2025-05-07T20:32:41.9926483Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:41.9926938Z def test_silu_mul_quant( 2025-05-07T20:32:41.9927175Z self, 2025-05-07T20:32:41.9927365Z T: int, 2025-05-07T20:32:41.9927559Z D: int, 2025-05-07T20:32:41.9927811Z scale_ub: Optional[float], 2025-05-07T20:32:41.9928077Z contiguous: bool, 2025-05-07T20:32:41.9928305Z compiled: bool, 2025-05-07T20:32:41.9928520Z ) -> None: 2025-05-07T20:32:41.9928734Z torch.manual_seed(2025) 2025-05-07T20:32:41.9928971Z 2025-05-07T20:32:41.9929234Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:41.9931451Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:41.9933503Z 2025-05-07T20:32:41.9933620Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:41.9933834Z 2025-05-07T20:32:41.9933934Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:41.9934352Z self=, 2025-05-07T20:32:41.9934766Z T=4096, 2025-05-07T20:32:41.9934951Z D=7168, 2025-05-07T20:32:41.9935134Z scale_ub=1200.0, 2025-05-07T20:32:41.9935347Z contiguous=True, 2025-05-07T20:32:41.9935560Z compiled=False, 2025-05-07T20:32:41.9935755Z ) 2025-05-07T20:32:41.9936075Z self = 2025-05-07T20:32:41.9936633Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:41.9936935Z 2025-05-07T20:32:41.9937009Z @given( 2025-05-07T20:32:41.9937229Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:41.9937543Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:41.9937851Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:41.9938179Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:41.9938511Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:41.9938797Z ) 2025-05-07T20:32:41.9939230Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:41.9939688Z def test_silu_mul_quant( 2025-05-07T20:32:41.9939926Z self, 2025-05-07T20:32:41.9940108Z T: int, 2025-05-07T20:32:41.9940295Z D: int, 2025-05-07T20:32:41.9940504Z scale_ub: Optional[float], 2025-05-07T20:32:41.9940768Z contiguous: bool, 2025-05-07T20:32:41.9941007Z compiled: bool, 2025-05-07T20:32:41.9941221Z ) -> None: 2025-05-07T20:32:41.9941421Z torch.manual_seed(2025) 2025-05-07T20:32:41.9941664Z 2025-05-07T20:32:41.9941929Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:41.9944161Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:41.9946219Z 2025-05-07T20:32:41.9946338Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:41.9946598Z 2025-05-07T20:32:41.9946696Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:41.9947116Z self=, 2025-05-07T20:32:41.9947532Z T=16384, 2025-05-07T20:32:41.9947716Z D=7168, 2025-05-07T20:32:41.9947898Z scale_ub=None, 2025-05-07T20:32:41.9948103Z contiguous=False, 2025-05-07T20:32:41.9948360Z compiled=True, 2025-05-07T20:32:41.9948558Z ) 2025-05-07T20:32:42.1256846Z self = 2025-05-07T20:32:42.1257592Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:42.1258044Z 2025-05-07T20:32:42.1258153Z @given( 2025-05-07T20:32:42.1258461Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.1258851Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.1259166Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.1259508Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.1259841Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.1260137Z ) 2025-05-07T20:32:42.1260503Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.1260963Z def test_silu_mul_quant( 2025-05-07T20:32:42.1261209Z self, 2025-05-07T20:32:42.1261405Z T: int, 2025-05-07T20:32:42.1261611Z D: int, 2025-05-07T20:32:42.1261824Z scale_ub: Optional[float], 2025-05-07T20:32:42.1262094Z contiguous: bool, 2025-05-07T20:32:42.1262336Z compiled: bool, 2025-05-07T20:32:42.1262563Z ) -> None: 2025-05-07T20:32:42.1262780Z torch.manual_seed(2025) 2025-05-07T20:32:42.1263020Z 2025-05-07T20:32:42.1263298Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.1265546Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:42.1267661Z 2025-05-07T20:32:42.1267781Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:42.1268002Z 2025-05-07T20:32:42.1268276Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.1268705Z self=, 2025-05-07T20:32:42.1269125Z T=4096, 2025-05-07T20:32:42.1269313Z D=7168, 2025-05-07T20:32:42.1275556Z scale_ub=None, 2025-05-07T20:32:42.1275819Z contiguous=True, 2025-05-07T20:32:42.1276052Z compiled=False, 2025-05-07T20:32:42.1276266Z ) 2025-05-07T20:32:42.1276606Z self = 2025-05-07T20:32:42.1277145Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:42.1277438Z 2025-05-07T20:32:42.1277516Z @given( 2025-05-07T20:32:42.1277750Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.1278074Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.1278389Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.1278730Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.1279078Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.1279373Z ) 2025-05-07T20:32:42.1279733Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.1280200Z def test_silu_mul_quant( 2025-05-07T20:32:42.1280440Z self, 2025-05-07T20:32:42.1280631Z T: int, 2025-05-07T20:32:42.1280825Z D: int, 2025-05-07T20:32:42.1281041Z scale_ub: Optional[float], 2025-05-07T20:32:42.1281420Z contiguous: bool, 2025-05-07T20:32:42.1281662Z compiled: bool, 2025-05-07T20:32:42.1281892Z ) -> None: 2025-05-07T20:32:42.1282105Z torch.manual_seed(2025) 2025-05-07T20:32:42.1282357Z 2025-05-07T20:32:42.1282644Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.1285258Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:42.1287319Z 2025-05-07T20:32:42.1287452Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:42.1287677Z 2025-05-07T20:32:42.1287783Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.1288214Z self=, 2025-05-07T20:32:42.1288646Z T=16384, 2025-05-07T20:32:42.1288843Z D=7168, 2025-05-07T20:32:42.1289053Z scale_ub=None, 2025-05-07T20:32:42.1289276Z contiguous=True, 2025-05-07T20:32:42.1289506Z compiled=False, 2025-05-07T20:32:42.1289715Z ) 2025-05-07T20:32:42.1290043Z self = 2025-05-07T20:32:42.1290575Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:42.1290869Z 2025-05-07T20:32:42.1290964Z @given( 2025-05-07T20:32:42.1291199Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.1291526Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.1291850Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.1292192Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.1292534Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.1292836Z ) 2025-05-07T20:32:42.1293203Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.1293670Z def test_silu_mul_quant( 2025-05-07T20:32:42.1293923Z self, 2025-05-07T20:32:42.1294120Z T: int, 2025-05-07T20:32:42.1294312Z D: int, 2025-05-07T20:32:42.1294531Z scale_ub: Optional[float], 2025-05-07T20:32:42.1294974Z contiguous: bool, 2025-05-07T20:32:42.1295212Z compiled: bool, 2025-05-07T20:32:42.1295440Z ) -> None: 2025-05-07T20:32:42.1295653Z torch.manual_seed(2025) 2025-05-07T20:32:42.1295889Z 2025-05-07T20:32:42.1296162Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.1298399Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:42.1300468Z 2025-05-07T20:32:42.1300584Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:42.1300807Z 2025-05-07T20:32:42.1300911Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.1301330Z self=, 2025-05-07T20:32:42.1301751Z T=16384, 2025-05-07T20:32:42.1301940Z D=7168, 2025-05-07T20:32:42.1302123Z scale_ub=1200.0, 2025-05-07T20:32:42.1302341Z contiguous=True, 2025-05-07T20:32:42.1302566Z compiled=False, 2025-05-07T20:32:42.1302823Z ) 2025-05-07T20:32:42.1303144Z self = 2025-05-07T20:32:42.1303661Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:42.1303957Z 2025-05-07T20:32:42.1304038Z @given( 2025-05-07T20:32:42.1304258Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.1304643Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.1304958Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.1305299Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.1305640Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.1305930Z ) 2025-05-07T20:32:42.1306283Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.1306746Z def test_silu_mul_quant( 2025-05-07T20:32:42.1306986Z self, 2025-05-07T20:32:42.1307173Z T: int, 2025-05-07T20:32:42.1307370Z D: int, 2025-05-07T20:32:42.1307586Z scale_ub: Optional[float], 2025-05-07T20:32:42.1307852Z contiguous: bool, 2025-05-07T20:32:42.1308091Z compiled: bool, 2025-05-07T20:32:42.1308311Z ) -> None: 2025-05-07T20:32:42.1308527Z torch.manual_seed(2025) 2025-05-07T20:32:42.1308766Z 2025-05-07T20:32:42.1309041Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.1311396Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:42.1313460Z 2025-05-07T20:32:42.1313579Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:42.1313796Z 2025-05-07T20:32:42.1313900Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.1314323Z self=, 2025-05-07T20:32:42.1314746Z T=128, 2025-05-07T20:32:42.1314924Z D=5120, 2025-05-07T20:32:42.1315104Z scale_ub=1200.0, 2025-05-07T20:32:42.1315326Z contiguous=False, 2025-05-07T20:32:42.1315549Z compiled=False, 2025-05-07T20:32:42.1315743Z ) 2025-05-07T20:32:42.2946061Z self = 2025-05-07T20:32:42.2947034Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:42.2947384Z 2025-05-07T20:32:42.2947471Z @given( 2025-05-07T20:32:42.2947698Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.2948023Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.2948339Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.2948682Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.2949017Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.2949311Z ) 2025-05-07T20:32:42.2949669Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.2950233Z def test_silu_mul_quant( 2025-05-07T20:32:42.2950476Z self, 2025-05-07T20:32:42.2950673Z T: int, 2025-05-07T20:32:42.2950867Z D: int, 2025-05-07T20:32:42.2951092Z scale_ub: Optional[float], 2025-05-07T20:32:42.2951368Z contiguous: bool, 2025-05-07T20:32:42.2951603Z compiled: bool, 2025-05-07T20:32:42.2951831Z ) -> None: 2025-05-07T20:32:42.2952051Z torch.manual_seed(2025) 2025-05-07T20:32:42.2952293Z 2025-05-07T20:32:42.2952566Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.2952996Z 2025-05-07T20:32:42.2953179Z x_sign = torch.sign(x) 2025-05-07T20:32:42.2953474Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.2953789Z x = x_sign * x_clamp 2025-05-07T20:32:42.2954025Z x0 = x[:, :D] 2025-05-07T20:32:42.2954230Z x1 = x[:, D:] 2025-05-07T20:32:42.2954431Z 2025-05-07T20:32:42.2954685Z if contiguous: 2025-05-07T20:32:42.2954906Z x0 = x0.contiguous() 2025-05-07T20:32:42.2955164Z x1 = x1.contiguous() 2025-05-07T20:32:42.2955400Z 2025-05-07T20:32:42.2955583Z if scale_ub is not None: 2025-05-07T20:32:42.2955860Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.2956203Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.2956519Z ) 2025-05-07T20:32:42.2956708Z else: 2025-05-07T20:32:42.2956920Z scale_ub_tensor = None 2025-05-07T20:32:42.2957166Z 2025-05-07T20:32:42.2957394Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.2957720Z op = silu_mul_quant 2025-05-07T20:32:42.2957961Z if compiled: 2025-05-07T20:32:42.2958207Z op = torch.compile(op) 2025-05-07T20:32:42.2958503Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.2958782Z 2025-05-07T20:32:42.2958962Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.2959134Z 2025-05-07T20:32:42.2959228Z moe/activation_test.py:117: 2025-05-07T20:32:42.2959521Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.2959862Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.2960142Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.2960881Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.2961623Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.2962188Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.2962916Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.2963622Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.2964181Z kernel = self.compile( 2025-05-07T20:32:42.2964750Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.2965531Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.2965941Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.2966179Z 2025-05-07T20:32:42.2966389Z self = 2025-05-07T20:32:42.2967558Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.2969062Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3dd34a6940>} 2025-05-07T20:32:42.2970529Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.2971636Z context = 2025-05-07T20:32:42.2971941Z 2025-05-07T20:32:42.2972105Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.2972653Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.2973139Z module_map=module_map) 2025-05-07T20:32:42.2973508Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.2973913Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.2974175Z E ^ 2025-05-07T20:32:42.2974654Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.2975145Z 2025-05-07T20:32:42.2975589Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.2976184Z 2025-05-07T20:32:42.2976285Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.2976712Z self=, 2025-05-07T20:32:42.2977131Z T=2048, 2025-05-07T20:32:42.2977312Z D=7168, 2025-05-07T20:32:42.2977505Z scale_ub=None, 2025-05-07T20:32:42.2977715Z contiguous=False, 2025-05-07T20:32:42.2977940Z compiled=False, 2025-05-07T20:32:42.2978141Z ) 2025-05-07T20:32:42.2978455Z self = 2025-05-07T20:32:42.2978968Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:42.2979261Z 2025-05-07T20:32:42.2979333Z @given( 2025-05-07T20:32:42.2979553Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.2979873Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.2980185Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.2980515Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.2980848Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.2981138Z ) 2025-05-07T20:32:42.2981491Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.2981947Z def test_silu_mul_quant( 2025-05-07T20:32:42.2982182Z self, 2025-05-07T20:32:42.2982368Z T: int, 2025-05-07T20:32:42.2982552Z D: int, 2025-05-07T20:32:42.2982939Z scale_ub: Optional[float], 2025-05-07T20:32:42.2983212Z contiguous: bool, 2025-05-07T20:32:42.2983447Z compiled: bool, 2025-05-07T20:32:42.2983667Z ) -> None: 2025-05-07T20:32:42.2983884Z torch.manual_seed(2025) 2025-05-07T20:32:42.2984123Z 2025-05-07T20:32:42.2984398Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.2986762Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:42.2988809Z 2025-05-07T20:32:42.2988927Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:42.2989144Z 2025-05-07T20:32:42.2989248Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.2989664Z self=, 2025-05-07T20:32:42.2990187Z T=128, 2025-05-07T20:32:42.2990363Z D=7168, 2025-05-07T20:32:42.2990541Z scale_ub=1200.0, 2025-05-07T20:32:42.2990759Z contiguous=True, 2025-05-07T20:32:42.2990974Z compiled=True, 2025-05-07T20:32:42.2991166Z ) 2025-05-07T20:32:42.3449958Z self = 2025-05-07T20:32:42.3450750Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:42.3451143Z 2025-05-07T20:32:42.3451234Z @given( 2025-05-07T20:32:42.3451464Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.3451781Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.3452093Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.3452433Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.3452883Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.3453176Z ) 2025-05-07T20:32:42.3453534Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.3453997Z def test_silu_mul_quant( 2025-05-07T20:32:42.3454235Z self, 2025-05-07T20:32:42.3454490Z T: int, 2025-05-07T20:32:42.3454686Z D: int, 2025-05-07T20:32:42.3454898Z scale_ub: Optional[float], 2025-05-07T20:32:42.3455174Z contiguous: bool, 2025-05-07T20:32:42.3455419Z compiled: bool, 2025-05-07T20:32:42.3455636Z ) -> None: 2025-05-07T20:32:42.3455850Z torch.manual_seed(2025) 2025-05-07T20:32:42.3456095Z 2025-05-07T20:32:42.3456366Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.3456722Z 2025-05-07T20:32:42.3456914Z x_sign = torch.sign(x) 2025-05-07T20:32:42.3457205Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.3457531Z x = x_sign * x_clamp 2025-05-07T20:32:42.3457772Z x0 = x[:, :D] 2025-05-07T20:32:42.3457990Z x1 = x[:, D:] 2025-05-07T20:32:42.3458192Z 2025-05-07T20:32:42.3458374Z if contiguous: 2025-05-07T20:32:42.3458603Z x0 = x0.contiguous() 2025-05-07T20:32:42.3458864Z x1 = x1.contiguous() 2025-05-07T20:32:42.3459107Z 2025-05-07T20:32:42.3459296Z if scale_ub is not None: 2025-05-07T20:32:42.3459569Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.3459913Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.3460228Z ) 2025-05-07T20:32:42.3460412Z else: 2025-05-07T20:32:42.3460616Z scale_ub_tensor = None 2025-05-07T20:32:42.3460864Z 2025-05-07T20:32:42.3461086Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.3461403Z op = silu_mul_quant 2025-05-07T20:32:42.3461650Z if compiled: 2025-05-07T20:32:42.3461892Z op = torch.compile(op) 2025-05-07T20:32:42.3462189Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.3462464Z 2025-05-07T20:32:42.3462653Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.3462821Z 2025-05-07T20:32:42.3462918Z moe/activation_test.py:117: 2025-05-07T20:32:42.3463218Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.3463559Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.3463833Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.3464569Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:42.3465171Z return fn(*args, **kwargs) 2025-05-07T20:32:42.3465865Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.3466604Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.3467169Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.3467892Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.3468591Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.3469153Z kernel = self.compile( 2025-05-07T20:32:42.3469851Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.3470551Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.3470956Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.3471197Z 2025-05-07T20:32:42.3471409Z self = 2025-05-07T20:32:42.3472575Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.3474120Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3dd3498940>} 2025-05-07T20:32:42.3475616Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.3476721Z context = 2025-05-07T20:32:42.3477026Z 2025-05-07T20:32:42.3477193Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.3477731Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.3478213Z module_map=module_map) 2025-05-07T20:32:42.3478584Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.3478939Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.3479192Z E ^ 2025-05-07T20:32:42.3479678Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.3480171Z 2025-05-07T20:32:42.3480618Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.3481170Z 2025-05-07T20:32:42.3481277Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.3481697Z self=, 2025-05-07T20:32:42.3482114Z T=128, 2025-05-07T20:32:42.3482298Z D=7168, 2025-05-07T20:32:42.3482484Z scale_ub=1200.0, 2025-05-07T20:32:42.3482696Z contiguous=True, 2025-05-07T20:32:42.3483079Z compiled=False, 2025-05-07T20:32:42.3483278Z ) 2025-05-07T20:32:42.3483595Z self = 2025-05-07T20:32:42.3484108Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:42.3484392Z 2025-05-07T20:32:42.3484471Z @given( 2025-05-07T20:32:42.3484690Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.3485008Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.3485318Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.3485646Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.3486106Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.3486395Z ) 2025-05-07T20:32:42.3486748Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.3487212Z def test_silu_mul_quant( 2025-05-07T20:32:42.3487446Z self, 2025-05-07T20:32:42.3487633Z T: int, 2025-05-07T20:32:42.3487820Z D: int, 2025-05-07T20:32:42.3488035Z scale_ub: Optional[float], 2025-05-07T20:32:42.3488305Z contiguous: bool, 2025-05-07T20:32:42.3488533Z compiled: bool, 2025-05-07T20:32:42.3488757Z ) -> None: 2025-05-07T20:32:42.3488973Z torch.manual_seed(2025) 2025-05-07T20:32:42.3489208Z 2025-05-07T20:32:42.3489480Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.3489842Z 2025-05-07T20:32:42.3490030Z x_sign = torch.sign(x) 2025-05-07T20:32:42.3490320Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.3492507Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:42.3494620Z 2025-05-07T20:32:42.3494732Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:42.3494950Z 2025-05-07T20:32:42.3495057Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.3495536Z self=, 2025-05-07T20:32:42.3495950Z T=128, 2025-05-07T20:32:42.3496130Z D=5120, 2025-05-07T20:32:42.3496327Z scale_ub=1200.0, 2025-05-07T20:32:42.3496582Z contiguous=True, 2025-05-07T20:32:42.3496797Z compiled=True, 2025-05-07T20:32:42.3496988Z ) 2025-05-07T20:32:42.3497307Z self = 2025-05-07T20:32:42.3497815Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:42.3498096Z 2025-05-07T20:32:42.3498173Z @given( 2025-05-07T20:32:42.3498391Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.3498715Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.3499023Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.3499352Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.3499684Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.3499980Z ) 2025-05-07T20:32:42.3500328Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.3500787Z def test_silu_mul_quant( 2025-05-07T20:32:42.3501023Z self, 2025-05-07T20:32:42.3501210Z T: int, 2025-05-07T20:32:42.3501401Z D: int, 2025-05-07T20:32:42.3501616Z scale_ub: Optional[float], 2025-05-07T20:32:42.3501882Z contiguous: bool, 2025-05-07T20:32:42.3502118Z compiled: bool, 2025-05-07T20:32:42.3502332Z ) -> None: 2025-05-07T20:32:42.3502542Z torch.manual_seed(2025) 2025-05-07T20:32:42.3502781Z 2025-05-07T20:32:42.3503055Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.3503407Z 2025-05-07T20:32:42.3503592Z x_sign = torch.sign(x) 2025-05-07T20:32:42.3503882Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.3506145Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:42.3508195Z 2025-05-07T20:32:42.3508314Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:42.3508530Z 2025-05-07T20:32:42.3508643Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.3509056Z self=, 2025-05-07T20:32:42.3509469Z T=128, 2025-05-07T20:32:42.3509652Z D=7168, 2025-05-07T20:32:42.3509913Z scale_ub=None, 2025-05-07T20:32:42.3510117Z contiguous=True, 2025-05-07T20:32:42.3510334Z compiled=True, 2025-05-07T20:32:42.3510530Z ) 2025-05-07T20:32:42.5923852Z self = 2025-05-07T20:32:42.5924613Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:42.5925008Z 2025-05-07T20:32:42.5925112Z @given( 2025-05-07T20:32:42.5925359Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.5925683Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.5931485Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.5931839Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.5932307Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.5932595Z ) 2025-05-07T20:32:42.5932958Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.5933432Z def test_silu_mul_quant( 2025-05-07T20:32:42.5933675Z self, 2025-05-07T20:32:42.5933875Z T: int, 2025-05-07T20:32:42.5934135Z D: int, 2025-05-07T20:32:42.5934350Z scale_ub: Optional[float], 2025-05-07T20:32:42.5934625Z contiguous: bool, 2025-05-07T20:32:42.5934870Z compiled: bool, 2025-05-07T20:32:42.5935109Z ) -> None: 2025-05-07T20:32:42.5935325Z torch.manual_seed(2025) 2025-05-07T20:32:42.5935575Z 2025-05-07T20:32:42.5935854Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.5938096Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:42.5940163Z 2025-05-07T20:32:42.5940284Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:42.5940511Z 2025-05-07T20:32:42.5967572Z FAILED 2025-05-07T20:32:42.5967929Z 2025-05-07T20:32:42.5968278Z =================================== FAILURES =================================== 2025-05-07T20:32:42.5968958Z _____________________ ActivationTests.test_silu_mul_quant ______________________ 2025-05-07T20:32:42.5969668Z + Exception Group Traceback (most recent call last): 2025-05-07T20:32:42.5970547Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/unittest/case.py", line 59, in testPartExecutor 2025-05-07T20:32:42.5971335Z | yield 2025-05-07T20:32:42.5971931Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/unittest/case.py", line 592, in run 2025-05-07T20:32:42.5972671Z | self._callTestMethod(testMethod) 2025-05-07T20:32:42.5973456Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/unittest/case.py", line 550, in _callTestMethod 2025-05-07T20:32:42.5974216Z | method() 2025-05-07T20:32:42.5975331Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 75, in test_silu_mul_quant 2025-05-07T20:32:42.5976390Z | T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.5977296Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/hypothesis/core.py", line 1850, in wrapped_test 2025-05-07T20:32:42.5978225Z | raise the_error_hypothesis_found 2025-05-07T20:32:42.5978945Z | exceptiongroup.ExceptionGroup: Hypothesis found 4 distinct failures. (4 sub-exceptions) 2025-05-07T20:32:42.5979650Z +-+---------------- 1 ---------------- 2025-05-07T20:32:42.5980044Z | Traceback (most recent call last): 2025-05-07T20:32:42.5981088Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:32:42.5982205Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.5985516Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:42.5988549Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:32:42.5989166Z | self=, 2025-05-07T20:32:42.5989929Z | T=2048, 2025-05-07T20:32:42.5990248Z | D=5120, # or any other generated value 2025-05-07T20:32:42.5990801Z | scale_ub=None, # or any other generated value 2025-05-07T20:32:42.5991613Z | contiguous=True, # or any other generated value 2025-05-07T20:32:42.5992132Z | compiled=False, # or any other generated value 2025-05-07T20:32:42.5992720Z | ) 2025-05-07T20:32:42.5992952Z | 2025-05-07T20:32:42.5993698Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEECQQBBAEEAQQE=') as a decorator on your test case 2025-05-07T20:32:42.5994566Z +---------------- 2 ---------------- 2025-05-07T20:32:42.5994966Z | Traceback (most recent call last): 2025-05-07T20:32:42.5996002Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:32:42.5996939Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.5999752Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:42.6002640Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:32:42.6003248Z | self=, 2025-05-07T20:32:42.6003830Z | T=128, 2025-05-07T20:32:42.6004104Z | D=7168, 2025-05-07T20:32:42.6004374Z | scale_ub=None, 2025-05-07T20:32:42.6004698Z | contiguous=True, 2025-05-07T20:32:42.6005028Z | compiled=True, 2025-05-07T20:32:42.6005334Z | ) 2025-05-07T20:32:42.6005534Z | 2025-05-07T20:32:42.6006089Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQFBAEEAQQA=') as a decorator on your test case 2025-05-07T20:32:42.6006891Z +---------------- 3 ---------------- 2025-05-07T20:32:42.6007181Z | Traceback (most recent call last): 2025-05-07T20:32:42.6007941Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:32:42.6008770Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.6011001Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:42.6013735Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:32:42.6014377Z | self=, 2025-05-07T20:32:42.6014971Z | T=128, 2025-05-07T20:32:42.6015265Z | D=5120, 2025-05-07T20:32:42.6015566Z | scale_ub=1200.0, 2025-05-07T20:32:42.6015918Z | contiguous=True, 2025-05-07T20:32:42.6016273Z | compiled=True, 2025-05-07T20:32:42.6016745Z | ) 2025-05-07T20:32:42.6017014Z | 2025-05-07T20:32:42.6017750Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQBBAUEAQQA=') as a decorator on your test case 2025-05-07T20:32:42.6018410Z +---------------- 4 ---------------- 2025-05-07T20:32:42.6018707Z | Traceback (most recent call last): 2025-05-07T20:32:42.6019567Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 126, in test_silu_mul_quant 2025-05-07T20:32:42.6020355Z | y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:42.6021059Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 124, in ref_fn 2025-05-07T20:32:42.6021811Z | return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:42.6022719Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py", line 2370, in triton_quantize_fp8_row 2025-05-07T20:32:42.6023728Z | _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:42.6024602Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py", line 330, in 2025-05-07T20:32:42.6025647Z | return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.6026785Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py", line 186, in run 2025-05-07T20:32:42.6027941Z | timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:42.6029136Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py", line 186, in 2025-05-07T20:32:42.6030562Z | timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:42.6031745Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py", line 166, in _bench 2025-05-07T20:32:42.6032772Z | return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:42.6033736Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py", line 117, in do_bench 2025-05-07T20:32:42.6034575Z | fn() 2025-05-07T20:32:42.6035563Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py", line 152, in kernel_call 2025-05-07T20:32:42.6036537Z | self.fn.run( 2025-05-07T20:32:42.6037346Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py", line 623, in run 2025-05-07T20:32:42.6038204Z | kernel = self.compile( 2025-05-07T20:32:42.6039092Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 273, in compile 2025-05-07T20:32:42.6040148Z | module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.6041194Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:42.6042364Z | return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.6043112Z | triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.6043607Z | def _kernel_quantize_fp8_row( 2025-05-07T20:32:42.6043994Z | ^ 2025-05-07T20:32:42.6044676Z | ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.6045520Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:32:42.6046091Z | # The test always failed when commented parts were varied together. 2025-05-07T20:32:42.6046891Z | self=, 2025-05-07T20:32:42.6047589Z | T=1, # or any other generated value 2025-05-07T20:32:42.6048030Z | D=5120, # or any other generated value 2025-05-07T20:32:42.6048504Z | scale_ub=None, # or any other generated value 2025-05-07T20:32:42.6049019Z | contiguous=True, # or any other generated value 2025-05-07T20:32:42.6049599Z | compiled=True, # or any other generated value 2025-05-07T20:32:42.6050025Z | ) 2025-05-07T20:32:42.6050276Z | 2025-05-07T20:32:42.6051021Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEAQQBBAEEAQQA=') as a decorator on your test case 2025-05-07T20:32:42.6051912Z +------------------------------------ 2025-05-07T20:32:42.6052425Z ---------------------------------- Hypothesis ---------------------------------- 2025-05-07T20:32:42.6052958Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.6053543Z self=, 2025-05-07T20:32:42.6054135Z T=1, 2025-05-07T20:32:42.6054398Z D=5120, 2025-05-07T20:32:42.6054666Z scale_ub=None, 2025-05-07T20:32:42.6054974Z contiguous=True, 2025-05-07T20:32:42.6055283Z compiled=True, 2025-05-07T20:32:42.6055587Z ) 2025-05-07T20:32:42.6056042Z self = 2025-05-07T20:32:42.6056791Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:42.6057170Z 2025-05-07T20:32:42.6057276Z @given( 2025-05-07T20:32:42.6057600Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.6058048Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.6058467Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.6058926Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.6059395Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.6059807Z ) 2025-05-07T20:32:42.6060318Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.6060978Z def test_silu_mul_quant( 2025-05-07T20:32:42.6061315Z self, 2025-05-07T20:32:42.6061585Z T: int, 2025-05-07T20:32:42.6061866Z D: int, 2025-05-07T20:32:42.6062165Z scale_ub: Optional[float], 2025-05-07T20:32:42.6062562Z contiguous: bool, 2025-05-07T20:32:42.6062894Z compiled: bool, 2025-05-07T20:32:42.6063213Z ) -> None: 2025-05-07T20:32:42.6063514Z torch.manual_seed(2025) 2025-05-07T20:32:42.6063870Z 2025-05-07T20:32:42.6064403Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.6064907Z 2025-05-07T20:32:42.6065189Z x_sign = torch.sign(x) 2025-05-07T20:32:42.6065610Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.6066060Z x = x_sign * x_clamp 2025-05-07T20:32:42.6066414Z x0 = x[:, :D] 2025-05-07T20:32:42.6066723Z x1 = x[:, D:] 2025-05-07T20:32:42.6067013Z 2025-05-07T20:32:42.6067278Z if contiguous: 2025-05-07T20:32:42.6067594Z x0 = x0.contiguous() 2025-05-07T20:32:42.6067953Z x1 = x1.contiguous() 2025-05-07T20:32:42.6068288Z 2025-05-07T20:32:42.6068553Z if scale_ub is not None: 2025-05-07T20:32:42.6068919Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.6069374Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.6069908Z ) 2025-05-07T20:32:42.6070165Z else: 2025-05-07T20:32:42.6070465Z scale_ub_tensor = None 2025-05-07T20:32:42.6070830Z 2025-05-07T20:32:42.6071157Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.6071594Z op = silu_mul_quant 2025-05-07T20:32:42.6071950Z if compiled: 2025-05-07T20:32:42.6072290Z op = torch.compile(op) 2025-05-07T20:32:42.6072717Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.6073218Z 2025-05-07T20:32:42.6073493Z y_fp8, y_scale = fn() 2025-05-07T20:32:42.6073899Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:42.6074329Z 2025-05-07T20:32:42.6074671Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.6075152Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:42.6075610Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:42.6076050Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:42.6076562Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:42.6076993Z 2025-05-07T20:32:42.6077262Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:42.6077540Z 2025-05-07T20:32:42.6077683Z moe/activation_test.py:126: 2025-05-07T20:32:42.6078084Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.6078552Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:42.6079009Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:42.6080142Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:42.6081219Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:42.6081965Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.6083229Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.6084191Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:42.6085195Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:42.6086228Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:42.6087263Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:42.6088310Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:42.6089264Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:42.6090159Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:42.6090930Z fn() 2025-05-07T20:32:42.6091926Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:42.6092788Z self.fn.run( 2025-05-07T20:32:42.6093466Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.6094243Z kernel = self.compile( 2025-05-07T20:32:42.6095025Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.6095971Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.6096533Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.6096886Z 2025-05-07T20:32:42.6097180Z self = 2025-05-07T20:32:42.6098792Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.6100907Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3dd80e7040>} 2025-05-07T20:32:42.6102968Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.6104550Z context = 2025-05-07T20:32:42.6104967Z 2025-05-07T20:32:42.6105197Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.6105935Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.6106687Z module_map=module_map) 2025-05-07T20:32:42.6107188Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.6107686Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:42.6108071Z E ^ 2025-05-07T20:32:42.6108725Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.6109395Z 2025-05-07T20:32:42.6113332Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.6114115Z 2025-05-07T20:32:42.6114260Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.6114843Z self=, 2025-05-07T20:32:42.6115409Z T=2048, 2025-05-07T20:32:42.6115673Z D=5120, 2025-05-07T20:32:42.6115945Z scale_ub=1200.0, 2025-05-07T20:32:42.6116256Z contiguous=True, 2025-05-07T20:32:42.6116579Z compiled=False, 2025-05-07T20:32:42.6116877Z ) 2025-05-07T20:32:42.6117330Z self = 2025-05-07T20:32:42.6118073Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:42.6118478Z 2025-05-07T20:32:42.6118580Z @given( 2025-05-07T20:32:42.6118872Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.6119277Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.6119683Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.6120122Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.6120553Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.6120936Z ) 2025-05-07T20:32:42.6121399Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.6121987Z def test_silu_mul_quant( 2025-05-07T20:32:42.6122303Z self, 2025-05-07T20:32:42.6122555Z T: int, 2025-05-07T20:32:42.6122816Z D: int, 2025-05-07T20:32:42.6123093Z scale_ub: Optional[float], 2025-05-07T20:32:42.6123451Z contiguous: bool, 2025-05-07T20:32:42.6123764Z compiled: bool, 2025-05-07T20:32:42.6124172Z ) -> None: 2025-05-07T20:32:42.6124464Z torch.manual_seed(2025) 2025-05-07T20:32:42.6124791Z 2025-05-07T20:32:42.6125136Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.6125593Z 2025-05-07T20:32:42.6125852Z x_sign = torch.sign(x) 2025-05-07T20:32:42.6126235Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.6126644Z x = x_sign * x_clamp 2025-05-07T20:32:42.6126970Z x0 = x[:, :D] 2025-05-07T20:32:42.6127249Z x1 = x[:, D:] 2025-05-07T20:32:42.6127522Z 2025-05-07T20:32:42.6127762Z if contiguous: 2025-05-07T20:32:42.6128059Z x0 = x0.contiguous() 2025-05-07T20:32:42.6128401Z x1 = x1.contiguous() 2025-05-07T20:32:42.6128718Z 2025-05-07T20:32:42.6128971Z if scale_ub is not None: 2025-05-07T20:32:42.6129334Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.6129780Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.6130205Z ) 2025-05-07T20:32:42.6130454Z else: 2025-05-07T20:32:42.6130729Z scale_ub_tensor = None 2025-05-07T20:32:42.6131063Z 2025-05-07T20:32:42.6131359Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.6131795Z op = silu_mul_quant 2025-05-07T20:32:42.6132151Z if compiled: 2025-05-07T20:32:42.6132475Z op = torch.compile(op) 2025-05-07T20:32:42.6132952Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.6133344Z 2025-05-07T20:32:42.6133593Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.6133830Z 2025-05-07T20:32:42.6133960Z moe/activation_test.py:117: 2025-05-07T20:32:42.6134371Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.6134896Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.6135278Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.6136263Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.6137251Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.6137973Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.6138915Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.6139833Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.6140572Z kernel = self.compile( 2025-05-07T20:32:42.6141303Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.6142194Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.6142736Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.6143054Z 2025-05-07T20:32:42.6143327Z self = 2025-05-07T20:32:42.6144827Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.6146820Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3dd85aa4c0>} 2025-05-07T20:32:42.6148776Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.6150385Z context = 2025-05-07T20:32:42.6150802Z 2025-05-07T20:32:42.6151036Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.6151880Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.6152540Z module_map=module_map) 2025-05-07T20:32:42.6153042Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.6153518Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.6153898Z E ^ 2025-05-07T20:32:42.6154578Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.6155262Z 2025-05-07T20:32:42.6155883Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.6156620Z 2025-05-07T20:32:42.6156761Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.6157353Z self=, 2025-05-07T20:32:42.6157908Z T=2048, 2025-05-07T20:32:42.6158166Z D=5120, 2025-05-07T20:32:42.6158436Z scale_ub=1200.0, 2025-05-07T20:32:42.6158739Z contiguous=True, 2025-05-07T20:32:42.6159041Z compiled=True, 2025-05-07T20:32:42.6159298Z ) 2025-05-07T20:32:42.6159751Z self = 2025-05-07T20:32:42.6160450Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:42.6160859Z 2025-05-07T20:32:42.6160985Z @given( 2025-05-07T20:32:42.6161356Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.6161780Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.6162213Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.6162660Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.6163108Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.6183118Z ) 2025-05-07T20:32:42.6183641Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.6184268Z def test_silu_mul_quant( 2025-05-07T20:32:42.6184608Z self, 2025-05-07T20:32:42.6184872Z T: int, 2025-05-07T20:32:42.6185165Z D: int, 2025-05-07T20:32:42.6185460Z scale_ub: Optional[float], 2025-05-07T20:32:42.6185868Z contiguous: bool, 2025-05-07T20:32:42.6186213Z compiled: bool, 2025-05-07T20:32:42.6186531Z ) -> None: 2025-05-07T20:32:42.6186847Z torch.manual_seed(2025) 2025-05-07T20:32:42.6187212Z 2025-05-07T20:32:42.6187611Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.6188094Z 2025-05-07T20:32:42.6188310Z x_sign = torch.sign(x) 2025-05-07T20:32:42.6188616Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.6188937Z x = x_sign * x_clamp 2025-05-07T20:32:42.6189198Z x0 = x[:, :D] 2025-05-07T20:32:42.6189425Z x1 = x[:, D:] 2025-05-07T20:32:42.6189637Z 2025-05-07T20:32:42.6189933Z if contiguous: 2025-05-07T20:32:42.6190175Z x0 = x0.contiguous() 2025-05-07T20:32:42.6190433Z x1 = x1.contiguous() 2025-05-07T20:32:42.6190680Z 2025-05-07T20:32:42.6190872Z if scale_ub is not None: 2025-05-07T20:32:42.6191148Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.6191488Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.6191807Z ) 2025-05-07T20:32:42.6191994Z else: 2025-05-07T20:32:42.6192212Z scale_ub_tensor = None 2025-05-07T20:32:42.6192468Z 2025-05-07T20:32:42.6192690Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.6193023Z op = silu_mul_quant 2025-05-07T20:32:42.6193278Z if compiled: 2025-05-07T20:32:42.6193529Z op = torch.compile(op) 2025-05-07T20:32:42.6193830Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.6194115Z 2025-05-07T20:32:42.6194309Z y_fp8, y_scale = fn() 2025-05-07T20:32:42.6194945Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:42.6195247Z 2025-05-07T20:32:42.6195488Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.6195833Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:42.6196130Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:42.6196453Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:42.6196821Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:42.6197134Z 2025-05-07T20:32:42.6197334Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:42.6197540Z 2025-05-07T20:32:42.6197640Z moe/activation_test.py:126: 2025-05-07T20:32:42.6197939Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.6198284Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:42.6198616Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:42.6199467Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:42.6200274Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:42.6200848Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.6201579Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.6202314Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:42.6203166Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:42.6203971Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:42.6204843Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:42.6205627Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:42.6206301Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:42.6206938Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:42.6207490Z fn() 2025-05-07T20:32:42.6208017Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:42.6208638Z self.fn.run( 2025-05-07T20:32:42.6209122Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.6209687Z kernel = self.compile( 2025-05-07T20:32:42.6210249Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.6210946Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.6211357Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.6211599Z 2025-05-07T20:32:42.6211815Z self = 2025-05-07T20:32:42.6212974Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.6214497Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3dd6cde0d0>} 2025-05-07T20:32:42.6215966Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.6217128Z context = 2025-05-07T20:32:42.6217429Z 2025-05-07T20:32:42.6217685Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.6218233Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.6218726Z module_map=module_map) 2025-05-07T20:32:42.6219097Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.6219455Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:42.6219733Z E ^ 2025-05-07T20:32:42.6220219Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.6220706Z 2025-05-07T20:32:42.6221152Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.6221714Z 2025-05-07T20:32:42.6221817Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.6222244Z self=, 2025-05-07T20:32:42.6222675Z T=16384, 2025-05-07T20:32:42.6222864Z D=7168, 2025-05-07T20:32:42.6223055Z scale_ub=1200.0, 2025-05-07T20:32:42.6223278Z contiguous=False, 2025-05-07T20:32:42.6223497Z compiled=False, 2025-05-07T20:32:42.6223704Z ) 2025-05-07T20:32:42.6224029Z self = 2025-05-07T20:32:42.6224548Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:42.6224896Z 2025-05-07T20:32:42.6224974Z @given( 2025-05-07T20:32:42.6225196Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.6225514Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.6225820Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.6226201Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.6226535Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.6226816Z ) 2025-05-07T20:32:42.6227179Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.6227642Z def test_silu_mul_quant( 2025-05-07T20:32:42.6227882Z self, 2025-05-07T20:32:42.6228080Z T: int, 2025-05-07T20:32:42.6228274Z D: int, 2025-05-07T20:32:42.6228486Z scale_ub: Optional[float], 2025-05-07T20:32:42.6230324Z contiguous: bool, 2025-05-07T20:32:42.6230568Z compiled: bool, 2025-05-07T20:32:42.6230789Z ) -> None: 2025-05-07T20:32:42.6231005Z torch.manual_seed(2025) 2025-05-07T20:32:42.6231255Z 2025-05-07T20:32:42.6231523Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.6231875Z 2025-05-07T20:32:42.6232065Z x_sign = torch.sign(x) 2025-05-07T20:32:42.6232358Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.6232680Z x = x_sign * x_clamp 2025-05-07T20:32:42.6232921Z x0 = x[:, :D] 2025-05-07T20:32:42.6233141Z x1 = x[:, D:] 2025-05-07T20:32:42.6233348Z 2025-05-07T20:32:42.6233531Z if contiguous: 2025-05-07T20:32:42.6233766Z x0 = x0.contiguous() 2025-05-07T20:32:42.6234023Z x1 = x1.contiguous() 2025-05-07T20:32:42.6234270Z 2025-05-07T20:32:42.6234466Z if scale_ub is not None: 2025-05-07T20:32:42.6234737Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.6235079Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.6235404Z ) 2025-05-07T20:32:42.6235593Z else: 2025-05-07T20:32:42.6235812Z scale_ub_tensor = None 2025-05-07T20:32:42.6236071Z 2025-05-07T20:32:42.6236296Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.6236620Z op = silu_mul_quant 2025-05-07T20:32:42.6236881Z if compiled: 2025-05-07T20:32:42.6237135Z op = torch.compile(op) 2025-05-07T20:32:42.6237434Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.6237718Z 2025-05-07T20:32:42.6238006Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.6238180Z 2025-05-07T20:32:42.6238278Z moe/activation_test.py:117: 2025-05-07T20:32:42.6238581Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.6238927Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.6239208Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.6239948Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.6240695Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.6241268Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.6241997Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.6242706Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.6243281Z kernel = self.compile( 2025-05-07T20:32:42.6243846Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.6244544Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.6244954Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.6245195Z 2025-05-07T20:32:42.6245462Z self = 2025-05-07T20:32:42.6246656Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.6248226Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3dd6c12040>} 2025-05-07T20:32:42.6249697Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.6250801Z context = 2025-05-07T20:32:42.6251105Z 2025-05-07T20:32:42.6251279Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.6251824Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.6252315Z module_map=module_map) 2025-05-07T20:32:42.6252690Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.6253046Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.6253318Z E ^ 2025-05-07T20:32:42.6253807Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.6254294Z 2025-05-07T20:32:42.6254757Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.6255317Z 2025-05-07T20:32:42.6255423Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.6255854Z self=, 2025-05-07T20:32:42.6256271Z T=1, 2025-05-07T20:32:42.6256449Z D=7168, 2025-05-07T20:32:42.6256654Z scale_ub=None, 2025-05-07T20:32:42.6256911Z contiguous=True, 2025-05-07T20:32:42.6257132Z compiled=True, 2025-05-07T20:32:42.6257335Z ) 2025-05-07T20:32:42.6257659Z self = 2025-05-07T20:32:42.6258165Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:42.6258439Z 2025-05-07T20:32:42.6258515Z @given( 2025-05-07T20:32:42.6258748Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.6259071Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.6259465Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.6259808Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.6260146Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.6260435Z ) 2025-05-07T20:32:42.6260795Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.6261259Z def test_silu_mul_quant( 2025-05-07T20:32:42.6261512Z self, 2025-05-07T20:32:42.6261700Z T: int, 2025-05-07T20:32:42.6261900Z D: int, 2025-05-07T20:32:42.6262117Z scale_ub: Optional[float], 2025-05-07T20:32:42.6262392Z contiguous: bool, 2025-05-07T20:32:42.6262634Z compiled: bool, 2025-05-07T20:32:42.6262865Z ) -> None: 2025-05-07T20:32:42.6263086Z torch.manual_seed(2025) 2025-05-07T20:32:42.6263335Z 2025-05-07T20:32:42.6263608Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.6263966Z 2025-05-07T20:32:42.6264162Z x_sign = torch.sign(x) 2025-05-07T20:32:42.6264451Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.6264774Z x = x_sign * x_clamp 2025-05-07T20:32:42.6265017Z x0 = x[:, :D] 2025-05-07T20:32:42.6265231Z x1 = x[:, D:] 2025-05-07T20:32:42.6265443Z 2025-05-07T20:32:42.6265628Z if contiguous: 2025-05-07T20:32:42.6265867Z x0 = x0.contiguous() 2025-05-07T20:32:42.6266177Z x1 = x1.contiguous() 2025-05-07T20:32:42.6266424Z 2025-05-07T20:32:42.6266624Z if scale_ub is not None: 2025-05-07T20:32:42.6266895Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.6267238Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.6267625Z ) 2025-05-07T20:32:42.6267815Z else: 2025-05-07T20:32:42.6268022Z scale_ub_tensor = None 2025-05-07T20:32:42.6268284Z 2025-05-07T20:32:42.6268517Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.6268840Z op = silu_mul_quant 2025-05-07T20:32:42.6269092Z if compiled: 2025-05-07T20:32:42.6269334Z op = torch.compile(op) 2025-05-07T20:32:42.6269638Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.6270027Z 2025-05-07T20:32:42.6270215Z y_fp8, y_scale = fn() 2025-05-07T20:32:42.6270506Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:42.6270807Z 2025-05-07T20:32:42.6271044Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.6271382Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:42.6271681Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:42.6272005Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:42.6272372Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:42.6272695Z 2025-05-07T20:32:42.6272898Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:42.6273104Z 2025-05-07T20:32:42.6273205Z moe/activation_test.py:126: 2025-05-07T20:32:42.6273508Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.6273858Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:42.6274193Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:42.6275033Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:42.6275852Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:42.6276432Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.6277154Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.6277891Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:42.6278754Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:42.6279566Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:42.6280364Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:42.6281150Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:42.6281841Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:42.6282485Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:42.6283303Z fn() 2025-05-07T20:32:42.6283846Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:42.6284466Z self.fn.run( 2025-05-07T20:32:42.6284958Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.6285526Z kernel = self.compile( 2025-05-07T20:32:42.6286097Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.6286844Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.6287250Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.6287584Z 2025-05-07T20:32:42.6287795Z self = 2025-05-07T20:32:42.6288963Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.6290537Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3dd6c12dc0>} 2025-05-07T20:32:42.6292013Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.6293129Z context = 2025-05-07T20:32:42.6293445Z 2025-05-07T20:32:42.6293615Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.6294163Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.6294653Z module_map=module_map) 2025-05-07T20:32:42.6295028Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.6295397Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:42.6295666Z E ^ 2025-05-07T20:32:42.6296154Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.6296696Z 2025-05-07T20:32:42.6297144Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.6297701Z 2025-05-07T20:32:42.6297810Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.6298231Z self=, 2025-05-07T20:32:42.6298656Z T=4096, 2025-05-07T20:32:42.6298840Z D=5120, 2025-05-07T20:32:42.6299032Z scale_ub=None, 2025-05-07T20:32:42.6299239Z contiguous=False, 2025-05-07T20:32:42.6299469Z compiled=False, 2025-05-07T20:32:42.6299675Z ) 2025-05-07T20:32:42.6299998Z self = 2025-05-07T20:32:42.6300520Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:42.6300812Z 2025-05-07T20:32:42.6300895Z @given( 2025-05-07T20:32:42.6301235Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.6301559Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.6301876Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.6302206Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.6302545Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.6302837Z ) 2025-05-07T20:32:42.6303196Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.6303659Z def test_silu_mul_quant( 2025-05-07T20:32:42.6303907Z self, 2025-05-07T20:32:42.6304099Z T: int, 2025-05-07T20:32:42.6304286Z D: int, 2025-05-07T20:32:42.6304493Z scale_ub: Optional[float], 2025-05-07T20:32:42.6304761Z contiguous: bool, 2025-05-07T20:32:42.6304998Z compiled: bool, 2025-05-07T20:32:42.6305222Z ) -> None: 2025-05-07T20:32:42.6305437Z torch.manual_seed(2025) 2025-05-07T20:32:42.6305677Z 2025-05-07T20:32:42.6305954Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.6306313Z 2025-05-07T20:32:42.6306506Z x_sign = torch.sign(x) 2025-05-07T20:32:42.6306803Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.6307121Z x = x_sign * x_clamp 2025-05-07T20:32:42.6307361Z x0 = x[:, :D] 2025-05-07T20:32:42.6307578Z x1 = x[:, D:] 2025-05-07T20:32:42.6307834Z 2025-05-07T20:32:42.6308017Z if contiguous: 2025-05-07T20:32:42.6308245Z x0 = x0.contiguous() 2025-05-07T20:32:42.6308504Z x1 = x1.contiguous() 2025-05-07T20:32:42.6308749Z 2025-05-07T20:32:42.6308933Z if scale_ub is not None: 2025-05-07T20:32:42.6309210Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.6309596Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.6310030Z ) 2025-05-07T20:32:42.6310224Z else: 2025-05-07T20:32:42.6310436Z scale_ub_tensor = None 2025-05-07T20:32:42.6310685Z 2025-05-07T20:32:42.6310915Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.6311236Z op = silu_mul_quant 2025-05-07T20:32:42.6311483Z if compiled: 2025-05-07T20:32:42.6311730Z op = torch.compile(op) 2025-05-07T20:32:42.6312030Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.6312306Z 2025-05-07T20:32:42.6312499Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.6312672Z 2025-05-07T20:32:42.6312770Z moe/activation_test.py:117: 2025-05-07T20:32:42.6313074Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.6313413Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.6313697Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.6314440Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.6315179Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.6315745Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.6316475Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.6317180Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.6317742Z kernel = self.compile( 2025-05-07T20:32:42.6318311Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.6319007Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.6319415Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.6319665Z 2025-05-07T20:32:42.6319876Z self = 2025-05-07T20:32:42.6321134Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.6322653Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3dd6c120d0>} 2025-05-07T20:32:42.6324138Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.6325250Z context = 2025-05-07T20:32:42.6325564Z 2025-05-07T20:32:42.6325735Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.6326285Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.6326784Z module_map=module_map) 2025-05-07T20:32:42.6327156Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.6327520Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.6327783Z E ^ 2025-05-07T20:32:42.6328269Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.6328763Z 2025-05-07T20:32:42.6329262Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.6329829Z 2025-05-07T20:32:42.6329935Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.6330365Z self=, 2025-05-07T20:32:42.6330828Z T=4096, 2025-05-07T20:32:42.6331014Z D=7168, 2025-05-07T20:32:42.6331207Z scale_ub=None, 2025-05-07T20:32:42.6331416Z contiguous=False, 2025-05-07T20:32:42.6331638Z compiled=False, 2025-05-07T20:32:42.6331845Z ) 2025-05-07T20:32:42.6332159Z self = 2025-05-07T20:32:42.6332674Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:42.6332968Z 2025-05-07T20:32:42.6333042Z @given( 2025-05-07T20:32:42.6333271Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.6333581Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.6333896Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.6334234Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.6334567Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.6334861Z ) 2025-05-07T20:32:42.6335221Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.6335678Z def test_silu_mul_quant( 2025-05-07T20:32:42.6335918Z self, 2025-05-07T20:32:42.6336109Z T: int, 2025-05-07T20:32:42.6336298Z D: int, 2025-05-07T20:32:42.6336540Z scale_ub: Optional[float], 2025-05-07T20:32:42.6336852Z contiguous: bool, 2025-05-07T20:32:42.6337092Z compiled: bool, 2025-05-07T20:32:42.6337308Z ) -> None: 2025-05-07T20:32:42.6337521Z torch.manual_seed(2025) 2025-05-07T20:32:42.6337764Z 2025-05-07T20:32:42.6338030Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.6338389Z 2025-05-07T20:32:42.6338581Z x_sign = torch.sign(x) 2025-05-07T20:32:42.6338866Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.6339183Z x = x_sign * x_clamp 2025-05-07T20:32:42.6339426Z x0 = x[:, :D] 2025-05-07T20:32:42.6339633Z x1 = x[:, D:] 2025-05-07T20:32:42.6346251Z 2025-05-07T20:32:42.6346473Z if contiguous: 2025-05-07T20:32:42.6346712Z x0 = x0.contiguous() 2025-05-07T20:32:42.6346979Z x1 = x1.contiguous() 2025-05-07T20:32:42.6347223Z 2025-05-07T20:32:42.6347526Z if scale_ub is not None: 2025-05-07T20:32:42.6347809Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.6348163Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.6348482Z ) 2025-05-07T20:32:42.6348686Z else: 2025-05-07T20:32:42.6348905Z scale_ub_tensor = None 2025-05-07T20:32:42.6349168Z 2025-05-07T20:32:42.6349397Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.6349833Z op = silu_mul_quant 2025-05-07T20:32:42.6350099Z if compiled: 2025-05-07T20:32:42.6350347Z op = torch.compile(op) 2025-05-07T20:32:42.6350657Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.6350942Z 2025-05-07T20:32:42.6351135Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.6351307Z 2025-05-07T20:32:42.6351409Z moe/activation_test.py:117: 2025-05-07T20:32:42.6351714Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.6352057Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.6352347Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.6353090Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.6353840Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.6354404Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.6355216Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.6355923Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.6356487Z kernel = self.compile( 2025-05-07T20:32:42.6357110Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.6357815Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.6358231Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.6358470Z 2025-05-07T20:32:42.6358686Z self = 2025-05-07T20:32:42.6359854Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.6361358Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3dd65f33a0>} 2025-05-07T20:32:42.6362823Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.6363938Z context = 2025-05-07T20:32:42.6364245Z 2025-05-07T20:32:42.6364415Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.6364961Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.6365453Z module_map=module_map) 2025-05-07T20:32:42.6365823Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.6366185Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.6366451Z E ^ 2025-05-07T20:32:42.6366933Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.6367422Z 2025-05-07T20:32:42.6367872Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.6368428Z 2025-05-07T20:32:42.6368532Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.6369041Z self=, 2025-05-07T20:32:42.6369456Z T=128, 2025-05-07T20:32:42.6369642Z D=7168, 2025-05-07T20:32:42.6369836Z scale_ub=None, 2025-05-07T20:32:42.6370052Z contiguous=False, 2025-05-07T20:32:42.6370282Z compiled=True, 2025-05-07T20:32:42.6370484Z ) 2025-05-07T20:32:42.6370803Z self = 2025-05-07T20:32:42.6371318Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:42.6371604Z 2025-05-07T20:32:42.6371681Z @given( 2025-05-07T20:32:42.6371909Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.6372219Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.6372530Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.6372865Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.6373195Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.6373495Z ) 2025-05-07T20:32:42.6373854Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.6374315Z def test_silu_mul_quant( 2025-05-07T20:32:42.6374547Z self, 2025-05-07T20:32:42.6374736Z T: int, 2025-05-07T20:32:42.6374930Z D: int, 2025-05-07T20:32:42.6375137Z scale_ub: Optional[float], 2025-05-07T20:32:42.6375454Z contiguous: bool, 2025-05-07T20:32:42.6375691Z compiled: bool, 2025-05-07T20:32:42.6375904Z ) -> None: 2025-05-07T20:32:42.6376112Z torch.manual_seed(2025) 2025-05-07T20:32:42.6376352Z 2025-05-07T20:32:42.6376643Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.6377030Z 2025-05-07T20:32:42.6377264Z x_sign = torch.sign(x) 2025-05-07T20:32:42.6377553Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.6377866Z x = x_sign * x_clamp 2025-05-07T20:32:42.6378116Z x0 = x[:, :D] 2025-05-07T20:32:42.6378328Z x1 = x[:, D:] 2025-05-07T20:32:42.6378534Z 2025-05-07T20:32:42.6378715Z if contiguous: 2025-05-07T20:32:42.6378939Z x0 = x0.contiguous() 2025-05-07T20:32:42.6379200Z x1 = x1.contiguous() 2025-05-07T20:32:42.6379440Z 2025-05-07T20:32:42.6379622Z if scale_ub is not None: 2025-05-07T20:32:42.6379894Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.6380231Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.6380542Z ) 2025-05-07T20:32:42.6380728Z else: 2025-05-07T20:32:42.6380939Z scale_ub_tensor = None 2025-05-07T20:32:42.6381191Z 2025-05-07T20:32:42.6381416Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.6381745Z op = silu_mul_quant 2025-05-07T20:32:42.6381991Z if compiled: 2025-05-07T20:32:42.6382233Z op = torch.compile(op) 2025-05-07T20:32:42.6382541Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.6383047Z 2025-05-07T20:32:42.6383294Z y_fp8, y_scale = fn() 2025-05-07T20:32:42.6383581Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:42.6383881Z 2025-05-07T20:32:42.6384117Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.6384462Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:42.6384768Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:42.6385084Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:42.6385446Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:42.6385767Z 2025-05-07T20:32:42.6385968Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:42.6386171Z 2025-05-07T20:32:42.6386272Z moe/activation_test.py:126: 2025-05-07T20:32:42.6386574Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.6387064Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:42.6387396Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:42.6388235Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:42.6389048Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:42.6389621Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.6390465Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.6391195Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:42.6391969Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:42.6392779Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:42.6393583Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:42.6394363Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:42.6395046Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:42.6395684Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:42.6396379Z fn() 2025-05-07T20:32:42.6396980Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:42.6397678Z self.fn.run( 2025-05-07T20:32:42.6398214Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.6398908Z kernel = self.compile( 2025-05-07T20:32:42.6399554Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.6400337Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.6400788Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.6401062Z 2025-05-07T20:32:42.6401296Z self = 2025-05-07T20:32:42.6402648Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.6404398Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3dd6791700>} 2025-05-07T20:32:42.6406111Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.6407227Z context = 2025-05-07T20:32:42.6407537Z 2025-05-07T20:32:42.6407706Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.6408246Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.6408730Z module_map=module_map) 2025-05-07T20:32:42.6409105Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.6409469Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:42.6409729Z E ^ 2025-05-07T20:32:42.6410216Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.6410708Z 2025-05-07T20:32:42.6411153Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.6411799Z 2025-05-07T20:32:42.6411910Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.6412331Z self=, 2025-05-07T20:32:42.6412747Z T=128, 2025-05-07T20:32:42.6412929Z D=7168, 2025-05-07T20:32:42.6413113Z scale_ub=None, 2025-05-07T20:32:42.6413321Z contiguous=False, 2025-05-07T20:32:42.6413541Z compiled=False, 2025-05-07T20:32:42.6413740Z ) 2025-05-07T20:32:42.6414057Z self = 2025-05-07T20:32:42.6414573Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:42.6414855Z 2025-05-07T20:32:42.6414937Z @given( 2025-05-07T20:32:42.6415158Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.6415479Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.6415786Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.6416117Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.6416452Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.6416739Z ) 2025-05-07T20:32:42.6417090Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.6417550Z def test_silu_mul_quant( 2025-05-07T20:32:42.6417785Z self, 2025-05-07T20:32:42.6417970Z T: int, 2025-05-07T20:32:42.6418204Z D: int, 2025-05-07T20:32:42.6418416Z scale_ub: Optional[float], 2025-05-07T20:32:42.6418689Z contiguous: bool, 2025-05-07T20:32:42.6418918Z compiled: bool, 2025-05-07T20:32:42.6419136Z ) -> None: 2025-05-07T20:32:42.6419343Z torch.manual_seed(2025) 2025-05-07T20:32:42.6419577Z 2025-05-07T20:32:42.6419841Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.6420240Z 2025-05-07T20:32:42.6420422Z x_sign = torch.sign(x) 2025-05-07T20:32:42.6420714Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.6421024Z x = x_sign * x_clamp 2025-05-07T20:32:42.6421258Z x0 = x[:, :D] 2025-05-07T20:32:42.6421471Z x1 = x[:, D:] 2025-05-07T20:32:42.6421676Z 2025-05-07T20:32:42.6421852Z if contiguous: 2025-05-07T20:32:42.6422079Z x0 = x0.contiguous() 2025-05-07T20:32:42.6422337Z x1 = x1.contiguous() 2025-05-07T20:32:42.6422570Z 2025-05-07T20:32:42.6422759Z if scale_ub is not None: 2025-05-07T20:32:42.6423031Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.6423366Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.6423672Z ) 2025-05-07T20:32:42.6423870Z else: 2025-05-07T20:32:42.6424072Z scale_ub_tensor = None 2025-05-07T20:32:42.6424331Z 2025-05-07T20:32:42.6424557Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.6424869Z op = silu_mul_quant 2025-05-07T20:32:42.6425119Z if compiled: 2025-05-07T20:32:42.6425369Z op = torch.compile(op) 2025-05-07T20:32:42.6425660Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.6425935Z 2025-05-07T20:32:42.6426119Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.6426286Z 2025-05-07T20:32:42.6426387Z moe/activation_test.py:117: 2025-05-07T20:32:42.6426680Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.6427024Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.6427308Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.6428036Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.6428771Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.6429338Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.6430301Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.6431000Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.6431557Z kernel = self.compile( 2025-05-07T20:32:42.6432121Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.6432805Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.6433216Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.6433458Z 2025-05-07T20:32:42.6433668Z self = 2025-05-07T20:32:42.6434825Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.6436322Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3dd6143310>} 2025-05-07T20:32:42.6437784Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.6438929Z context = 2025-05-07T20:32:42.6439232Z 2025-05-07T20:32:42.6439406Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.6439948Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.6440433Z module_map=module_map) 2025-05-07T20:32:42.6440847Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.6441207Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.6441462Z E ^ 2025-05-07T20:32:42.6441952Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.6442436Z 2025-05-07T20:32:42.6442884Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.6443435Z 2025-05-07T20:32:42.6443540Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.6443959Z self=, 2025-05-07T20:32:42.6444370Z T=4096, 2025-05-07T20:32:42.6444549Z D=5120, 2025-05-07T20:32:42.6444729Z scale_ub=1200.0, 2025-05-07T20:32:42.6444944Z contiguous=True, 2025-05-07T20:32:42.6445158Z compiled=False, 2025-05-07T20:32:42.6445354Z ) 2025-05-07T20:32:42.6445678Z self = 2025-05-07T20:32:42.6446189Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:42.6446479Z 2025-05-07T20:32:42.6446555Z @given( 2025-05-07T20:32:42.6446777Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.6447092Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.6447401Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.6447730Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.6448062Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.6448354Z ) 2025-05-07T20:32:42.6448702Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.6449157Z def test_silu_mul_quant( 2025-05-07T20:32:42.6449400Z self, 2025-05-07T20:32:42.6449584Z T: int, 2025-05-07T20:32:42.6449774Z D: int, 2025-05-07T20:32:42.6449988Z scale_ub: Optional[float], 2025-05-07T20:32:42.6450251Z contiguous: bool, 2025-05-07T20:32:42.6450487Z compiled: bool, 2025-05-07T20:32:42.6450704Z ) -> None: 2025-05-07T20:32:42.6451001Z torch.manual_seed(2025) 2025-05-07T20:32:42.6451238Z 2025-05-07T20:32:42.6451505Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.6451859Z 2025-05-07T20:32:42.6452044Z x_sign = torch.sign(x) 2025-05-07T20:32:42.6452335Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.6452645Z x = x_sign * x_clamp 2025-05-07T20:32:42.6452879Z x0 = x[:, :D] 2025-05-07T20:32:42.6453094Z x1 = x[:, D:] 2025-05-07T20:32:42.6453295Z 2025-05-07T20:32:42.6453469Z if contiguous: 2025-05-07T20:32:42.6453704Z x0 = x0.contiguous() 2025-05-07T20:32:42.6453966Z x1 = x1.contiguous() 2025-05-07T20:32:42.6454199Z 2025-05-07T20:32:42.6454391Z if scale_ub is not None: 2025-05-07T20:32:42.6454664Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.6454995Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.6455312Z ) 2025-05-07T20:32:42.6455504Z else: 2025-05-07T20:32:42.6455704Z scale_ub_tensor = None 2025-05-07T20:32:42.6455950Z 2025-05-07T20:32:42.6456174Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.6456493Z op = silu_mul_quant 2025-05-07T20:32:42.6456763Z if compiled: 2025-05-07T20:32:42.6457022Z op = torch.compile(op) 2025-05-07T20:32:42.6457387Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.6457657Z 2025-05-07T20:32:42.6457843Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.6458005Z 2025-05-07T20:32:42.6458101Z moe/activation_test.py:117: 2025-05-07T20:32:42.6458392Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.6458769Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.6459048Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.6459778Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.6460507Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.6461065Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.6461790Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.6462484Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.6463051Z kernel = self.compile( 2025-05-07T20:32:42.6463467Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.6463642Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.6463773Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.6463778Z 2025-05-07T20:32:42.6463993Z self = 2025-05-07T20:32:42.6464841Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.6465389Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3dd60f3c10>} 2025-05-07T20:32:42.6466203Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.6466398Z context = 2025-05-07T20:32:42.6466405Z 2025-05-07T20:32:42.6466577Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.6466931Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.6467041Z module_map=module_map) 2025-05-07T20:32:42.6467201Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.6467297Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.6467373Z E ^ 2025-05-07T20:32:42.6467751Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.6467759Z 2025-05-07T20:32:42.6468201Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.6468211Z 2025-05-07T20:32:42.6468310Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.6468542Z self=, 2025-05-07T20:32:42.6468622Z T=1, 2025-05-07T20:32:42.6468695Z D=5120, 2025-05-07T20:32:42.6468774Z scale_ub=None, 2025-05-07T20:32:42.6468868Z contiguous=True, 2025-05-07T20:32:42.6468949Z compiled=True, 2025-05-07T20:32:42.6469019Z ) 2025-05-07T20:32:42.6469246Z self = 2025-05-07T20:32:42.6469409Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:42.6469414Z 2025-05-07T20:32:42.6469487Z @given( 2025-05-07T20:32:42.6469604Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.6469836Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.6469955Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.6470068Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.6470178Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.6470293Z ) 2025-05-07T20:32:42.6470547Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.6470638Z def test_silu_mul_quant( 2025-05-07T20:32:42.6470719Z self, 2025-05-07T20:32:42.6470799Z T: int, 2025-05-07T20:32:42.6470871Z D: int, 2025-05-07T20:32:42.6470969Z scale_ub: Optional[float], 2025-05-07T20:32:42.6471057Z contiguous: bool, 2025-05-07T20:32:42.6471144Z compiled: bool, 2025-05-07T20:32:42.6471220Z ) -> None: 2025-05-07T20:32:42.6471310Z torch.manual_seed(2025) 2025-05-07T20:32:42.6471384Z 2025-05-07T20:32:42.6471555Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.6471628Z 2025-05-07T20:32:42.6471722Z x_sign = torch.sign(x) 2025-05-07T20:32:42.6471844Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.6471929Z x = x_sign * x_clamp 2025-05-07T20:32:42.6472012Z x0 = x[:, :D] 2025-05-07T20:32:42.6472091Z x1 = x[:, D:] 2025-05-07T20:32:42.6472162Z 2025-05-07T20:32:42.6472246Z if contiguous: 2025-05-07T20:32:42.6472334Z x0 = x0.contiguous() 2025-05-07T20:32:42.6472433Z x1 = x1.contiguous() 2025-05-07T20:32:42.6472503Z 2025-05-07T20:32:42.6472591Z if scale_ub is not None: 2025-05-07T20:32:42.6472699Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.6472834Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.6472904Z ) 2025-05-07T20:32:42.6472980Z else: 2025-05-07T20:32:42.6473072Z scale_ub_tensor = None 2025-05-07T20:32:42.6473148Z 2025-05-07T20:32:42.6473278Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.6473365Z op = silu_mul_quant 2025-05-07T20:32:42.6473449Z if compiled: 2025-05-07T20:32:42.6473549Z op = torch.compile(op) 2025-05-07T20:32:42.6473650Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.6473727Z 2025-05-07T20:32:42.6473811Z y_fp8, y_scale = fn() 2025-05-07T20:32:42.6473928Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:42.6474077Z 2025-05-07T20:32:42.6474211Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.6474309Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:42.6474407Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:42.6474529Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:42.6474669Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:42.6474748Z 2025-05-07T20:32:42.6474847Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:42.6474851Z 2025-05-07T20:32:42.6474947Z moe/activation_test.py:126: 2025-05-07T20:32:42.6475075Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.6475172Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:42.6475312Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:42.6475927Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:42.6476024Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:42.6476414Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.6476644Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.6477038Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:42.6477344Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:42.6477768Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:42.6478071Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:42.6478475Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:42.6478645Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:42.6479005Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:42.6479076Z fn() 2025-05-07T20:32:42.6479509Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:42.6479594Z self.fn.run( 2025-05-07T20:32:42.6479953Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.6480052Z kernel = self.compile( 2025-05-07T20:32:42.6480460Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.6480645Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.6480775Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.6480783Z 2025-05-07T20:32:42.6480993Z self = 2025-05-07T20:32:42.6481844Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.6482396Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3dd5e9e3a0>} 2025-05-07T20:32:42.6483467Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.6483674Z context = 2025-05-07T20:32:42.6483679Z 2025-05-07T20:32:42.6484009Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.6484331Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.6484442Z module_map=module_map) 2025-05-07T20:32:42.6484621Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.6484727Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:42.6484806Z E ^ 2025-05-07T20:32:42.6485246Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.6485250Z 2025-05-07T20:32:42.6485756Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.6485763Z 2025-05-07T20:32:42.6485874Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.6486131Z self=, 2025-05-07T20:32:42.6486208Z T=2048, 2025-05-07T20:32:42.6486297Z D=5120, 2025-05-07T20:32:42.6486379Z scale_ub=None, 2025-05-07T20:32:42.6486466Z contiguous=True, 2025-05-07T20:32:42.6486551Z compiled=True, 2025-05-07T20:32:42.6486623Z ) 2025-05-07T20:32:42.6486874Z self = 2025-05-07T20:32:42.6487067Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:42.6487143Z 2025-05-07T20:32:42.6487223Z @given( 2025-05-07T20:32:42.6487344Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.6487436Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.6487548Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.6487669Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.6487837Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.6487910Z ) 2025-05-07T20:32:42.6488177Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.6488268Z def test_silu_mul_quant( 2025-05-07T20:32:42.6488341Z self, 2025-05-07T20:32:42.6488417Z T: int, 2025-05-07T20:32:42.6488491Z D: int, 2025-05-07T20:32:42.6488588Z scale_ub: Optional[float], 2025-05-07T20:32:42.6488682Z contiguous: bool, 2025-05-07T20:32:42.6488764Z compiled: bool, 2025-05-07T20:32:42.6488843Z ) -> None: 2025-05-07T20:32:42.6488935Z torch.manual_seed(2025) 2025-05-07T20:32:42.6489003Z 2025-05-07T20:32:42.6489177Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.6489246Z 2025-05-07T20:32:42.6489336Z x_sign = torch.sign(x) 2025-05-07T20:32:42.6489462Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.6489552Z x = x_sign * x_clamp 2025-05-07T20:32:42.6489627Z x0 = x[:, :D] 2025-05-07T20:32:42.6489714Z x1 = x[:, D:] 2025-05-07T20:32:42.6489787Z 2025-05-07T20:32:42.6489872Z if contiguous: 2025-05-07T20:32:42.6489964Z x0 = x0.contiguous() 2025-05-07T20:32:42.6490053Z x1 = x1.contiguous() 2025-05-07T20:32:42.6490126Z 2025-05-07T20:32:42.6490219Z if scale_ub is not None: 2025-05-07T20:32:42.6490322Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.6490456Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.6490530Z ) 2025-05-07T20:32:42.6490600Z else: 2025-05-07T20:32:42.6490696Z scale_ub_tensor = None 2025-05-07T20:32:42.6490768Z 2025-05-07T20:32:42.6490894Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.6490983Z op = silu_mul_quant 2025-05-07T20:32:42.6491065Z if compiled: 2025-05-07T20:32:42.6491164Z op = torch.compile(op) 2025-05-07T20:32:42.6491268Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.6491336Z 2025-05-07T20:32:42.6491425Z y_fp8, y_scale = fn() 2025-05-07T20:32:42.6491629Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:42.6491701Z 2025-05-07T20:32:42.6491839Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.6491937Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:42.6492035Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:42.6492156Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:42.6492300Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:42.6492373Z 2025-05-07T20:32:42.6492480Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:42.6492485Z 2025-05-07T20:32:42.6496859Z moe/activation_test.py:126: 2025-05-07T20:32:42.6497017Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.6497135Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:42.6497276Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:42.6497901Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:42.6498003Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:42.6498394Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.6498626Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.6499094Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:42.6499365Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:42.6499793Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:42.6500111Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:42.6500516Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:42.6500688Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:42.6501060Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:42.6501135Z fn() 2025-05-07T20:32:42.6501575Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:42.6501660Z self.fn.run( 2025-05-07T20:32:42.6502021Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.6502125Z kernel = self.compile( 2025-05-07T20:32:42.6502538Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.6502717Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.6502854Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.6502859Z 2025-05-07T20:32:42.6503073Z self = 2025-05-07T20:32:42.6503931Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.6504483Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3dd5cd69d0>} 2025-05-07T20:32:42.6505309Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.6505587Z context = 2025-05-07T20:32:42.6505593Z 2025-05-07T20:32:42.6505766Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.6506047Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.6506159Z module_map=module_map) 2025-05-07T20:32:42.6506329Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.6506433Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:42.6506510Z E ^ 2025-05-07T20:32:42.6506896Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.6506901Z 2025-05-07T20:32:42.6507349Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.6507357Z 2025-05-07T20:32:42.6507465Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.6507700Z self=, 2025-05-07T20:32:42.6507777Z T=128, 2025-05-07T20:32:42.6507857Z D=5120, 2025-05-07T20:32:42.6507938Z scale_ub=None, 2025-05-07T20:32:42.6508023Z contiguous=True, 2025-05-07T20:32:42.6508109Z compiled=True, 2025-05-07T20:32:42.6508179Z ) 2025-05-07T20:32:42.6508405Z self = 2025-05-07T20:32:42.6508624Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:42.6508628Z 2025-05-07T20:32:42.6508707Z @given( 2025-05-07T20:32:42.6508823Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.6508928Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.6509041Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.6509203Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.6509320Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.6509394Z ) 2025-05-07T20:32:42.6509658Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.6509852Z def test_silu_mul_quant( 2025-05-07T20:32:42.6509930Z self, 2025-05-07T20:32:42.6510014Z T: int, 2025-05-07T20:32:42.6510093Z D: int, 2025-05-07T20:32:42.6510189Z scale_ub: Optional[float], 2025-05-07T20:32:42.6510281Z contiguous: bool, 2025-05-07T20:32:42.6510369Z compiled: bool, 2025-05-07T20:32:42.6510456Z ) -> None: 2025-05-07T20:32:42.6510552Z torch.manual_seed(2025) 2025-05-07T20:32:42.6510623Z 2025-05-07T20:32:42.6510800Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.6510875Z 2025-05-07T20:32:42.6510966Z x_sign = torch.sign(x) 2025-05-07T20:32:42.6511096Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.6511184Z x = x_sign * x_clamp 2025-05-07T20:32:42.6511264Z x0 = x[:, :D] 2025-05-07T20:32:42.6511353Z x1 = x[:, D:] 2025-05-07T20:32:42.6511427Z 2025-05-07T20:32:42.6511509Z if contiguous: 2025-05-07T20:32:42.6511602Z x0 = x0.contiguous() 2025-05-07T20:32:42.6511694Z x1 = x1.contiguous() 2025-05-07T20:32:42.6511764Z 2025-05-07T20:32:42.6511860Z if scale_ub is not None: 2025-05-07T20:32:42.6511966Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.6512104Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.6512182Z ) 2025-05-07T20:32:42.6512258Z else: 2025-05-07T20:32:42.6512353Z scale_ub_tensor = None 2025-05-07T20:32:42.6512425Z 2025-05-07T20:32:42.6512555Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.6512648Z op = silu_mul_quant 2025-05-07T20:32:42.6512732Z if compiled: 2025-05-07T20:32:42.6512830Z op = torch.compile(op) 2025-05-07T20:32:42.6512938Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.6513093Z 2025-05-07T20:32:42.6513185Z y_fp8, y_scale = fn() 2025-05-07T20:32:42.6513309Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:42.6513387Z 2025-05-07T20:32:42.6513523Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.6513627Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:42.6513727Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:42.6513857Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:42.6513999Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:42.6514075Z 2025-05-07T20:32:42.6514180Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:42.6514185Z 2025-05-07T20:32:42.6514283Z moe/activation_test.py:126: 2025-05-07T20:32:42.6514415Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.6514524Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:42.6514664Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:42.6515278Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:42.6515378Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:42.6515762Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.6516041Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.6516433Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:42.6516698Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:42.6517197Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:42.6517470Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:42.6517876Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:42.6518047Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:42.6518412Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:42.6518491Z fn() 2025-05-07T20:32:42.6518925Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:42.6519011Z self.fn.run( 2025-05-07T20:32:42.6519373Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.6519468Z kernel = self.compile( 2025-05-07T20:32:42.6519881Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.6520062Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.6520192Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.6520201Z 2025-05-07T20:32:42.6520412Z self = 2025-05-07T20:32:42.6521262Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.6521825Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3dd5944d30>} 2025-05-07T20:32:42.6522728Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.6522934Z context = 2025-05-07T20:32:42.6522939Z 2025-05-07T20:32:42.6523107Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.6523383Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.6523495Z module_map=module_map) 2025-05-07T20:32:42.6523661Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.6523762Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:42.6523841Z E ^ 2025-05-07T20:32:42.6524225Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.6524232Z 2025-05-07T20:32:42.6524683Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.6524688Z 2025-05-07T20:32:42.6524794Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.6525023Z self=, 2025-05-07T20:32:42.6525100Z T=4096, 2025-05-07T20:32:42.6525177Z D=5120, 2025-05-07T20:32:42.6525262Z scale_ub=None, 2025-05-07T20:32:42.6525346Z contiguous=True, 2025-05-07T20:32:42.6525429Z compiled=True, 2025-05-07T20:32:42.6525501Z ) 2025-05-07T20:32:42.6525770Z self = 2025-05-07T20:32:42.6525945Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:42.6525949Z 2025-05-07T20:32:42.6526029Z @given( 2025-05-07T20:32:42.6526147Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.6526289Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.6526412Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.6526527Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.6526649Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.6526724Z ) 2025-05-07T20:32:42.6526981Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.6527081Z def test_silu_mul_quant( 2025-05-07T20:32:42.6527155Z self, 2025-05-07T20:32:42.6527232Z T: int, 2025-05-07T20:32:42.6527308Z D: int, 2025-05-07T20:32:42.6527409Z scale_ub: Optional[float], 2025-05-07T20:32:42.6527501Z contiguous: bool, 2025-05-07T20:32:42.6527589Z compiled: bool, 2025-05-07T20:32:42.6527668Z ) -> None: 2025-05-07T20:32:42.6527762Z torch.manual_seed(2025) 2025-05-07T20:32:42.6527840Z 2025-05-07T20:32:42.6528010Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.6528091Z 2025-05-07T20:32:42.6528184Z x_sign = torch.sign(x) 2025-05-07T20:32:42.6528308Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.6528402Z x = x_sign * x_clamp 2025-05-07T20:32:42.6528481Z x0 = x[:, :D] 2025-05-07T20:32:42.6528560Z x1 = x[:, D:] 2025-05-07T20:32:42.6528637Z 2025-05-07T20:32:42.6528720Z if contiguous: 2025-05-07T20:32:42.6528809Z x0 = x0.contiguous() 2025-05-07T20:32:42.6528903Z x1 = x1.contiguous() 2025-05-07T20:32:42.6528978Z 2025-05-07T20:32:42.6529070Z if scale_ub is not None: 2025-05-07T20:32:42.6529186Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.6529320Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.6529397Z ) 2025-05-07T20:32:42.6529472Z else: 2025-05-07T20:32:42.6529566Z scale_ub_tensor = None 2025-05-07T20:32:42.6529645Z 2025-05-07T20:32:42.6529777Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.6529865Z op = silu_mul_quant 2025-05-07T20:32:42.6529950Z if compiled: 2025-05-07T20:32:42.6530161Z op = torch.compile(op) 2025-05-07T20:32:42.6530268Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.6530343Z 2025-05-07T20:32:42.6530433Z y_fp8, y_scale = fn() 2025-05-07T20:32:42.6530553Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:42.6530627Z 2025-05-07T20:32:42.6530761Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.6530865Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:42.6530966Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:42.6531088Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:42.6531232Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:42.6531308Z 2025-05-07T20:32:42.6531412Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:42.6531416Z 2025-05-07T20:32:42.6531513Z moe/activation_test.py:126: 2025-05-07T20:32:42.6531651Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.6531757Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:42.6531898Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:42.6532505Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:42.6532607Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:42.6533038Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.6533268Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.6533663Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:42.6533970Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:42.6534405Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:42.6534670Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:42.6535069Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:42.6535242Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:42.6535606Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:42.6535680Z fn() 2025-05-07T20:32:42.6536114Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:42.6536199Z self.fn.run( 2025-05-07T20:32:42.6536588Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.6536701Z kernel = self.compile( 2025-05-07T20:32:42.6537112Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.6537294Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.6537424Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.6537428Z 2025-05-07T20:32:42.6537638Z self = 2025-05-07T20:32:42.6538493Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.6539039Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3dd5944670>} 2025-05-07T20:32:42.6539934Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.6540133Z context = 2025-05-07T20:32:42.6540138Z 2025-05-07T20:32:42.6540310Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.6540586Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.6540695Z module_map=module_map) 2025-05-07T20:32:42.6540866Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.6540969Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:42.6541048Z E ^ 2025-05-07T20:32:42.6541433Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.6541440Z 2025-05-07T20:32:42.6541887Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.6541891Z 2025-05-07T20:32:42.6541995Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.6542226Z self=, 2025-05-07T20:32:42.6542304Z T=16384, 2025-05-07T20:32:42.6542383Z D=5120, 2025-05-07T20:32:42.6542464Z scale_ub=None, 2025-05-07T20:32:42.6542590Z contiguous=True, 2025-05-07T20:32:42.6542676Z compiled=True, 2025-05-07T20:32:42.6542752Z ) 2025-05-07T20:32:42.6542982Z self = 2025-05-07T20:32:42.6543155Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:42.6543159Z 2025-05-07T20:32:42.6543275Z @given( 2025-05-07T20:32:42.6543399Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.6543496Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.6543614Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.6543732Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.6543844Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.6543918Z ) 2025-05-07T20:32:42.6544174Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.6544268Z def test_silu_mul_quant( 2025-05-07T20:32:42.6544347Z self, 2025-05-07T20:32:42.6544426Z T: int, 2025-05-07T20:32:42.6544502Z D: int, 2025-05-07T20:32:42.6544604Z scale_ub: Optional[float], 2025-05-07T20:32:42.6544694Z contiguous: bool, 2025-05-07T20:32:42.6544779Z compiled: bool, 2025-05-07T20:32:42.6544859Z ) -> None: 2025-05-07T20:32:42.6544952Z torch.manual_seed(2025) 2025-05-07T20:32:42.6545026Z 2025-05-07T20:32:42.6545200Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.6545273Z 2025-05-07T20:32:42.6545364Z x_sign = torch.sign(x) 2025-05-07T20:32:42.6545497Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.6545581Z x = x_sign * x_clamp 2025-05-07T20:32:42.6545664Z x0 = x[:, :D] 2025-05-07T20:32:42.6545744Z x1 = x[:, D:] 2025-05-07T20:32:42.6545811Z 2025-05-07T20:32:42.6545893Z if contiguous: 2025-05-07T20:32:42.6545984Z x0 = x0.contiguous() 2025-05-07T20:32:42.6546073Z x1 = x1.contiguous() 2025-05-07T20:32:42.6546155Z 2025-05-07T20:32:42.6546246Z if scale_ub is not None: 2025-05-07T20:32:42.6546351Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.6546489Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.6546562Z ) 2025-05-07T20:32:42.6546637Z else: 2025-05-07T20:32:42.6546739Z scale_ub_tensor = None 2025-05-07T20:32:42.6546808Z 2025-05-07T20:32:42.6546942Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.6547115Z op = silu_mul_quant 2025-05-07T20:32:42.6547201Z if compiled: 2025-05-07T20:32:42.6547302Z op = torch.compile(op) 2025-05-07T20:32:42.6547405Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.6547478Z 2025-05-07T20:32:42.6547570Z y_fp8, y_scale = fn() 2025-05-07T20:32:42.6547689Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:42.6547762Z 2025-05-07T20:32:42.6547902Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.6547999Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:42.6548093Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:42.6548216Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:42.6548355Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:42.6548432Z 2025-05-07T20:32:42.6548529Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:42.6548534Z 2025-05-07T20:32:42.6548633Z moe/activation_test.py:126: 2025-05-07T20:32:42.6548769Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.6548870Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:42.6549003Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:42.6549612Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:42.6549984Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:42.6550376Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.6550611Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.6551045Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:42.6551324Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:42.6551750Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:42.6552016Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:42.6552419Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:42.6552589Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:42.6552956Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:42.6553032Z fn() 2025-05-07T20:32:42.6553458Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:42.6553542Z self.fn.run( 2025-05-07T20:32:42.6553900Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.6554000Z kernel = self.compile( 2025-05-07T20:32:42.6554404Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.6554580Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.6554712Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.6554719Z 2025-05-07T20:32:42.6554928Z self = 2025-05-07T20:32:42.6555779Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.6556331Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3dd5a87dc0>} 2025-05-07T20:32:42.6557221Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.6557421Z context = 2025-05-07T20:32:42.6557426Z 2025-05-07T20:32:42.6557595Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.6557874Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.6557978Z module_map=module_map) 2025-05-07T20:32:42.6558138Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.6558241Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:42.6558320Z E ^ 2025-05-07T20:32:42.6558703Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.6558718Z 2025-05-07T20:32:42.6559162Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.6559167Z 2025-05-07T20:32:42.6559267Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.6559501Z self=, 2025-05-07T20:32:42.6559577Z T=1, 2025-05-07T20:32:42.6559694Z D=5120, 2025-05-07T20:32:42.6559777Z scale_ub=1200.0, 2025-05-07T20:32:42.6559860Z contiguous=True, 2025-05-07T20:32:42.6559943Z compiled=True, 2025-05-07T20:32:42.6560016Z ) 2025-05-07T20:32:42.6560239Z self = 2025-05-07T20:32:42.6560414Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:42.6560460Z 2025-05-07T20:32:42.6560533Z @given( 2025-05-07T20:32:42.6560649Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.6560754Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.6560868Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.6560983Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.6561097Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.6561169Z ) 2025-05-07T20:32:42.6561428Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.6561521Z def test_silu_mul_quant( 2025-05-07T20:32:42.6561592Z self, 2025-05-07T20:32:42.6561677Z T: int, 2025-05-07T20:32:42.6561752Z D: int, 2025-05-07T20:32:42.6561850Z scale_ub: Optional[float], 2025-05-07T20:32:42.6561941Z contiguous: bool, 2025-05-07T20:32:42.6562024Z compiled: bool, 2025-05-07T20:32:42.6562107Z ) -> None: 2025-05-07T20:32:42.6562202Z torch.manual_seed(2025) 2025-05-07T20:32:42.6562274Z 2025-05-07T20:32:42.6562447Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.6562524Z 2025-05-07T20:32:42.6562613Z x_sign = torch.sign(x) 2025-05-07T20:32:42.6562739Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.6562830Z x = x_sign * x_clamp 2025-05-07T20:32:42.6562910Z x0 = x[:, :D] 2025-05-07T20:32:42.6562987Z x1 = x[:, D:] 2025-05-07T20:32:42.6563065Z 2025-05-07T20:32:42.6563146Z if contiguous: 2025-05-07T20:32:42.6563241Z x0 = x0.contiguous() 2025-05-07T20:32:42.6563331Z x1 = x1.contiguous() 2025-05-07T20:32:42.6563404Z 2025-05-07T20:32:42.6563494Z if scale_ub is not None: 2025-05-07T20:32:42.6563604Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.6563738Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.6563813Z ) 2025-05-07T20:32:42.6563888Z else: 2025-05-07T20:32:42.6563980Z scale_ub_tensor = None 2025-05-07T20:32:42.6564054Z 2025-05-07T20:32:42.6564260Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.6564349Z op = silu_mul_quant 2025-05-07T20:32:42.6564434Z if compiled: 2025-05-07T20:32:42.6564531Z op = torch.compile(op) 2025-05-07T20:32:42.6564633Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.6564704Z 2025-05-07T20:32:42.6564793Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.6564800Z 2025-05-07T20:32:42.6564893Z moe/activation_test.py:117: 2025-05-07T20:32:42.6565025Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.6565122Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.6565222Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.6565614Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:42.6565706Z return fn(*args, **kwargs) 2025-05-07T20:32:42.6566251Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.6566348Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.6566727Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.6566961Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.6567361Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.6567454Z kernel = self.compile( 2025-05-07T20:32:42.6567861Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.6568036Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.6568210Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.6568215Z 2025-05-07T20:32:42.6568427Z self = 2025-05-07T20:32:42.6569279Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.6569824Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3dd55e9ca0>} 2025-05-07T20:32:42.6570636Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.6570836Z context = 2025-05-07T20:32:42.6570841Z 2025-05-07T20:32:42.6571007Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.6571289Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.6571393Z module_map=module_map) 2025-05-07T20:32:42.6571553Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.6571652Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.6571725Z E ^ 2025-05-07T20:32:42.6572105Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.6572115Z 2025-05-07T20:32:42.6572559Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.6572564Z 2025-05-07T20:32:42.6572665Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.6572901Z self=, 2025-05-07T20:32:42.6572975Z T=1, 2025-05-07T20:32:42.6573048Z D=5120, 2025-05-07T20:32:42.6573211Z scale_ub=None, 2025-05-07T20:32:42.6573296Z contiguous=False, 2025-05-07T20:32:42.6573376Z compiled=True, 2025-05-07T20:32:42.6573449Z ) 2025-05-07T20:32:42.6573671Z self = 2025-05-07T20:32:42.6573839Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:42.6573843Z 2025-05-07T20:32:42.6573918Z @given( 2025-05-07T20:32:42.6574036Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.6574134Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.6574247Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.6574362Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.6574474Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.6574551Z ) 2025-05-07T20:32:42.6574807Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.6574906Z def test_silu_mul_quant( 2025-05-07T20:32:42.6574985Z self, 2025-05-07T20:32:42.6575065Z T: int, 2025-05-07T20:32:42.6575138Z D: int, 2025-05-07T20:32:42.6575234Z scale_ub: Optional[float], 2025-05-07T20:32:42.6575324Z contiguous: bool, 2025-05-07T20:32:42.6575406Z compiled: bool, 2025-05-07T20:32:42.6575483Z ) -> None: 2025-05-07T20:32:42.6575579Z torch.manual_seed(2025) 2025-05-07T20:32:42.6575692Z 2025-05-07T20:32:42.6575861Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.6575934Z 2025-05-07T20:32:42.6576022Z x_sign = torch.sign(x) 2025-05-07T20:32:42.6576144Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.6576232Z x = x_sign * x_clamp 2025-05-07T20:32:42.6576350Z x0 = x[:, :D] 2025-05-07T20:32:42.6576429Z x1 = x[:, D:] 2025-05-07T20:32:42.6576502Z 2025-05-07T20:32:42.6576583Z if contiguous: 2025-05-07T20:32:42.6576675Z x0 = x0.contiguous() 2025-05-07T20:32:42.6576766Z x1 = x1.contiguous() 2025-05-07T20:32:42.6576838Z 2025-05-07T20:32:42.6576934Z if scale_ub is not None: 2025-05-07T20:32:42.6577038Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.6577170Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.6577244Z ) 2025-05-07T20:32:42.6577318Z else: 2025-05-07T20:32:42.6577411Z scale_ub_tensor = None 2025-05-07T20:32:42.6577483Z 2025-05-07T20:32:42.6577609Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.6577703Z op = silu_mul_quant 2025-05-07T20:32:42.6577786Z if compiled: 2025-05-07T20:32:42.6577883Z op = torch.compile(op) 2025-05-07T20:32:42.6577995Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.6578066Z 2025-05-07T20:32:42.6578154Z y_fp8, y_scale = fn() 2025-05-07T20:32:42.6578275Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:42.6578350Z 2025-05-07T20:32:42.6578485Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.6578586Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:42.6578681Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:42.6578800Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:42.6578942Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:42.6579017Z 2025-05-07T20:32:42.6579123Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:42.6579127Z 2025-05-07T20:32:42.6579222Z moe/activation_test.py:126: 2025-05-07T20:32:42.6579351Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.6579455Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:42.6579590Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:42.6580296Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:42.6580399Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:42.6580782Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.6581016Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.6581404Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:42.6581670Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:42.6582097Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:42.6582360Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:42.6582964Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:42.6583207Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:42.6583580Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:42.6583670Z fn() 2025-05-07T20:32:42.6584105Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:42.6584277Z self.fn.run( 2025-05-07T20:32:42.6584646Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.6584745Z kernel = self.compile( 2025-05-07T20:32:42.6585160Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.6585401Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.6585538Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.6585543Z 2025-05-07T20:32:42.6585756Z self = 2025-05-07T20:32:42.6586604Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.6587157Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3dd59ebdc0>} 2025-05-07T20:32:42.6587970Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.6588170Z context = 2025-05-07T20:32:42.6588178Z 2025-05-07T20:32:42.6588352Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.6588630Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.6588740Z module_map=module_map) 2025-05-07T20:32:42.6588903Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.6589006Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:42.6589093Z E ^ 2025-05-07T20:32:42.6589477Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.6589482Z 2025-05-07T20:32:42.6590007Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.6590014Z 2025-05-07T20:32:42.6590121Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.6590355Z self=, 2025-05-07T20:32:42.6590556Z T=1, 2025-05-07T20:32:42.6590634Z D=5120, 2025-05-07T20:32:42.6590718Z scale_ub=None, 2025-05-07T20:32:42.6590805Z contiguous=True, 2025-05-07T20:32:42.6590887Z compiled=False, 2025-05-07T20:32:42.6590957Z ) 2025-05-07T20:32:42.6591189Z self = 2025-05-07T20:32:42.6591361Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:42.6591369Z 2025-05-07T20:32:42.6591452Z @given( 2025-05-07T20:32:42.6591573Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.6591672Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.6591792Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.6591915Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.6592035Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.6592112Z ) 2025-05-07T20:32:42.6592381Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.6592476Z def test_silu_mul_quant( 2025-05-07T20:32:42.6592559Z self, 2025-05-07T20:32:42.6592640Z T: int, 2025-05-07T20:32:42.6592720Z D: int, 2025-05-07T20:32:42.6592821Z scale_ub: Optional[float], 2025-05-07T20:32:42.6592910Z contiguous: bool, 2025-05-07T20:32:42.6592999Z compiled: bool, 2025-05-07T20:32:42.6593123Z ) -> None: 2025-05-07T20:32:42.6593221Z torch.manual_seed(2025) 2025-05-07T20:32:42.6593299Z 2025-05-07T20:32:42.6593475Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.6593553Z 2025-05-07T20:32:42.6593650Z x_sign = torch.sign(x) 2025-05-07T20:32:42.6593776Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.6593907Z x = x_sign * x_clamp 2025-05-07T20:32:42.6593992Z x0 = x[:, :D] 2025-05-07T20:32:42.6594076Z x1 = x[:, D:] 2025-05-07T20:32:42.6594151Z 2025-05-07T20:32:42.6594241Z if contiguous: 2025-05-07T20:32:42.6594333Z x0 = x0.contiguous() 2025-05-07T20:32:42.6594428Z x1 = x1.contiguous() 2025-05-07T20:32:42.6594501Z 2025-05-07T20:32:42.6594595Z if scale_ub is not None: 2025-05-07T20:32:42.6594706Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.6594845Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.6594928Z ) 2025-05-07T20:32:42.6595013Z else: 2025-05-07T20:32:42.6595111Z scale_ub_tensor = None 2025-05-07T20:32:42.6595184Z 2025-05-07T20:32:42.6595321Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.6595412Z op = silu_mul_quant 2025-05-07T20:32:42.6595497Z if compiled: 2025-05-07T20:32:42.6595608Z op = torch.compile(op) 2025-05-07T20:32:42.6595716Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.6595799Z 2025-05-07T20:32:42.6595896Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.6595901Z 2025-05-07T20:32:42.6595999Z moe/activation_test.py:117: 2025-05-07T20:32:42.6596141Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.6596247Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.6596349Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.6596898Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.6597000Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.6597393Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.6597630Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.6598001Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.6598101Z kernel = self.compile( 2025-05-07T20:32:42.6598595Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.6598778Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.6598911Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.6598916Z 2025-05-07T20:32:42.6599130Z self = 2025-05-07T20:32:42.6599985Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.6600541Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3dd54c78b0>} 2025-05-07T20:32:42.6601369Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.6601571Z context = 2025-05-07T20:32:42.6601576Z 2025-05-07T20:32:42.6601747Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.6602083Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.6602191Z module_map=module_map) 2025-05-07T20:32:42.6602354Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.6602453Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.6602571Z E ^ 2025-05-07T20:32:42.6602955Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.6602959Z 2025-05-07T20:32:42.6603411Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.6603416Z 2025-05-07T20:32:42.6603522Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.6603755Z self=, 2025-05-07T20:32:42.6603832Z T=128, 2025-05-07T20:32:42.6603918Z D=5120, 2025-05-07T20:32:42.6604006Z scale_ub=None, 2025-05-07T20:32:42.6604096Z contiguous=False, 2025-05-07T20:32:42.6604186Z compiled=True, 2025-05-07T20:32:42.6604265Z ) 2025-05-07T20:32:42.6604494Z self = 2025-05-07T20:32:42.6604676Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:42.6604685Z 2025-05-07T20:32:42.6604765Z @given( 2025-05-07T20:32:42.6604888Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.6604991Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.6605117Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.6605239Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.6605355Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.6605432Z ) 2025-05-07T20:32:42.6605696Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.6605793Z def test_silu_mul_quant( 2025-05-07T20:32:42.6605874Z self, 2025-05-07T20:32:42.6605958Z T: int, 2025-05-07T20:32:42.6606039Z D: int, 2025-05-07T20:32:42.6606140Z scale_ub: Optional[float], 2025-05-07T20:32:42.6606233Z contiguous: bool, 2025-05-07T20:32:42.6606321Z compiled: bool, 2025-05-07T20:32:42.6606398Z ) -> None: 2025-05-07T20:32:42.6606504Z torch.manual_seed(2025) 2025-05-07T20:32:42.6606579Z 2025-05-07T20:32:42.6606757Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.6606831Z 2025-05-07T20:32:42.6607002Z x_sign = torch.sign(x) 2025-05-07T20:32:42.6607131Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.6607219Z x = x_sign * x_clamp 2025-05-07T20:32:42.6607302Z x0 = x[:, :D] 2025-05-07T20:32:42.6607387Z x1 = x[:, D:] 2025-05-07T20:32:42.6607462Z 2025-05-07T20:32:42.6607548Z if contiguous: 2025-05-07T20:32:42.6607644Z x0 = x0.contiguous() 2025-05-07T20:32:42.6607738Z x1 = x1.contiguous() 2025-05-07T20:32:42.6607810Z 2025-05-07T20:32:42.6607904Z if scale_ub is not None: 2025-05-07T20:32:42.6608009Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.6608150Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.6608232Z ) 2025-05-07T20:32:42.6608314Z else: 2025-05-07T20:32:42.6608416Z scale_ub_tensor = None 2025-05-07T20:32:42.6608496Z 2025-05-07T20:32:42.6608629Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.6608730Z op = silu_mul_quant 2025-05-07T20:32:42.6608816Z if compiled: 2025-05-07T20:32:42.6608918Z op = torch.compile(op) 2025-05-07T20:32:42.6609030Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.6609106Z 2025-05-07T20:32:42.6609202Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.6609206Z 2025-05-07T20:32:42.6609312Z moe/activation_test.py:117: 2025-05-07T20:32:42.6609487Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.6609590Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.6609689Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.6610083Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:42.6610220Z return fn(*args, **kwargs) 2025-05-07T20:32:42.6610761Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.6610862Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.6611250Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.6611487Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.6611854Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.6611951Z kernel = self.compile( 2025-05-07T20:32:42.6612364Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.6612550Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.6612687Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.6612691Z 2025-05-07T20:32:42.6612911Z self = 2025-05-07T20:32:42.6613768Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.6614319Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3dd510b5e0>} 2025-05-07T20:32:42.6615145Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.6615346Z context = 2025-05-07T20:32:42.6615354Z 2025-05-07T20:32:42.6615529Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.6615911Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.6616020Z module_map=module_map) 2025-05-07T20:32:42.6616189Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.6616291Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.6616375Z E ^ 2025-05-07T20:32:42.6616763Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.6616770Z 2025-05-07T20:32:42.6617220Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.6617225Z 2025-05-07T20:32:42.6617334Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.6617567Z self=, 2025-05-07T20:32:42.6617658Z T=128, 2025-05-07T20:32:42.6617739Z D=7168, 2025-05-07T20:32:42.6617826Z scale_ub=1200.0, 2025-05-07T20:32:42.6617921Z contiguous=False, 2025-05-07T20:32:42.6618013Z compiled=False, 2025-05-07T20:32:42.6618094Z ) 2025-05-07T20:32:42.6618326Z self = 2025-05-07T20:32:42.6618506Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:42.6618510Z 2025-05-07T20:32:42.6618588Z @given( 2025-05-07T20:32:42.6618712Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.6618856Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.6618971Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.6619092Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.6619204Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.6619322Z ) 2025-05-07T20:32:42.6619582Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.6619673Z def test_silu_mul_quant( 2025-05-07T20:32:42.6619758Z self, 2025-05-07T20:32:42.6619843Z T: int, 2025-05-07T20:32:42.6619923Z D: int, 2025-05-07T20:32:42.6620029Z scale_ub: Optional[float], 2025-05-07T20:32:42.6620118Z contiguous: bool, 2025-05-07T20:32:42.6620208Z compiled: bool, 2025-05-07T20:32:42.6620292Z ) -> None: 2025-05-07T20:32:42.6620395Z torch.manual_seed(2025) 2025-05-07T20:32:42.6620475Z 2025-05-07T20:32:42.6620655Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.6620737Z 2025-05-07T20:32:42.6620833Z x_sign = torch.sign(x) 2025-05-07T20:32:42.6620960Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.6621052Z x = x_sign * x_clamp 2025-05-07T20:32:42.6621137Z x0 = x[:, :D] 2025-05-07T20:32:42.6621226Z x1 = x[:, D:] 2025-05-07T20:32:42.6621302Z 2025-05-07T20:32:42.6621392Z if contiguous: 2025-05-07T20:32:42.6621488Z x0 = x0.contiguous() 2025-05-07T20:32:42.6621584Z x1 = x1.contiguous() 2025-05-07T20:32:42.6621664Z 2025-05-07T20:32:42.6621758Z if scale_ub is not None: 2025-05-07T20:32:42.6621865Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.6622015Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.6622095Z ) 2025-05-07T20:32:42.6622179Z else: 2025-05-07T20:32:42.6622276Z scale_ub_tensor = None 2025-05-07T20:32:42.6622354Z 2025-05-07T20:32:42.6622489Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.6622581Z op = silu_mul_quant 2025-05-07T20:32:42.6622668Z if compiled: 2025-05-07T20:32:42.6622775Z op = torch.compile(op) 2025-05-07T20:32:42.6622882Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.6622961Z 2025-05-07T20:32:42.6623058Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.6623063Z 2025-05-07T20:32:42.6623161Z moe/activation_test.py:117: 2025-05-07T20:32:42.6623379Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.6623485Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.6623585Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.6624134Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.6624235Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.6624625Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.6624866Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.6625240Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.6625340Z kernel = self.compile( 2025-05-07T20:32:42.6625759Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.6625949Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.6626092Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.6626097Z 2025-05-07T20:32:42.6626315Z self = 2025-05-07T20:32:42.6631151Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.6631807Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3dd4add430>} 2025-05-07T20:32:42.6632677Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.6632879Z context = 2025-05-07T20:32:42.6632884Z 2025-05-07T20:32:42.6633062Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.6633342Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.6633450Z module_map=module_map) 2025-05-07T20:32:42.6633620Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.6633720Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.6633804Z E ^ 2025-05-07T20:32:42.6634195Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.6634203Z 2025-05-07T20:32:42.6634658Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.6634663Z 2025-05-07T20:32:42.6634778Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.6635015Z self=, 2025-05-07T20:32:42.6635098Z T=128, 2025-05-07T20:32:42.6635185Z D=5120, 2025-05-07T20:32:42.6635273Z scale_ub=None, 2025-05-07T20:32:42.6635369Z contiguous=False, 2025-05-07T20:32:42.6635458Z compiled=False, 2025-05-07T20:32:42.6635538Z ) 2025-05-07T20:32:42.6635769Z self = 2025-05-07T20:32:42.6635957Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:42.6635962Z 2025-05-07T20:32:42.6636044Z @given( 2025-05-07T20:32:42.6636171Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.6636278Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.6636398Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.6636523Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.6636720Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.6636801Z ) 2025-05-07T20:32:42.6637066Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.6637165Z def test_silu_mul_quant( 2025-05-07T20:32:42.6637257Z self, 2025-05-07T20:32:42.6637341Z T: int, 2025-05-07T20:32:42.6637424Z D: int, 2025-05-07T20:32:42.6637532Z scale_ub: Optional[float], 2025-05-07T20:32:42.6637624Z contiguous: bool, 2025-05-07T20:32:42.6637716Z compiled: bool, 2025-05-07T20:32:42.6637803Z ) -> None: 2025-05-07T20:32:42.6637900Z torch.manual_seed(2025) 2025-05-07T20:32:42.6637978Z 2025-05-07T20:32:42.6638159Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.6638240Z 2025-05-07T20:32:42.6638336Z x_sign = torch.sign(x) 2025-05-07T20:32:42.6638468Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.6638565Z x = x_sign * x_clamp 2025-05-07T20:32:42.6638649Z x0 = x[:, :D] 2025-05-07T20:32:42.6638734Z x1 = x[:, D:] 2025-05-07T20:32:42.6638811Z 2025-05-07T20:32:42.6638902Z if contiguous: 2025-05-07T20:32:42.6638997Z x0 = x0.contiguous() 2025-05-07T20:32:42.6639088Z x1 = x1.contiguous() 2025-05-07T20:32:42.6639168Z 2025-05-07T20:32:42.6639264Z if scale_ub is not None: 2025-05-07T20:32:42.6639418Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.6639560Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.6639639Z ) 2025-05-07T20:32:42.6639713Z else: 2025-05-07T20:32:42.6639818Z scale_ub_tensor = None 2025-05-07T20:32:42.6639933Z 2025-05-07T20:32:42.6640063Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.6640153Z op = silu_mul_quant 2025-05-07T20:32:42.6640238Z if compiled: 2025-05-07T20:32:42.6640350Z op = torch.compile(op) 2025-05-07T20:32:42.6640458Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.6640532Z 2025-05-07T20:32:42.6640627Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.6640631Z 2025-05-07T20:32:42.6640727Z moe/activation_test.py:117: 2025-05-07T20:32:42.6640861Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.6640966Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.6641066Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.6641614Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.6641720Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.6642116Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.6642356Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.6642726Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.6642826Z kernel = self.compile( 2025-05-07T20:32:42.6643243Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.6643427Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.6643566Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.6643574Z 2025-05-07T20:32:42.6643789Z self = 2025-05-07T20:32:42.6644647Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.6645295Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3dd4addca0>} 2025-05-07T20:32:42.6646111Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.6646316Z context = 2025-05-07T20:32:42.6646323Z 2025-05-07T20:32:42.6646494Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.6646771Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.6646883Z module_map=module_map) 2025-05-07T20:32:42.6647050Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.6647155Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.6647236Z E ^ 2025-05-07T20:32:42.6647626Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.6647632Z 2025-05-07T20:32:42.6648088Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.6648092Z 2025-05-07T20:32:42.6648197Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.6648470Z self=, 2025-05-07T20:32:42.6648558Z T=128, 2025-05-07T20:32:42.6648640Z D=5120, 2025-05-07T20:32:42.6648729Z scale_ub=1200.0, 2025-05-07T20:32:42.6648819Z contiguous=True, 2025-05-07T20:32:42.6648907Z compiled=False, 2025-05-07T20:32:42.6648988Z ) 2025-05-07T20:32:42.6649284Z self = 2025-05-07T20:32:42.6649462Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:42.6649467Z 2025-05-07T20:32:42.6649555Z @given( 2025-05-07T20:32:42.6649680Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.6649782Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.6649901Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.6650022Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.6650144Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.6650225Z ) 2025-05-07T20:32:42.6650489Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.6650590Z def test_silu_mul_quant( 2025-05-07T20:32:42.6650670Z self, 2025-05-07T20:32:42.6650751Z T: int, 2025-05-07T20:32:42.6650835Z D: int, 2025-05-07T20:32:42.6650934Z scale_ub: Optional[float], 2025-05-07T20:32:42.6651030Z contiguous: bool, 2025-05-07T20:32:42.6651121Z compiled: bool, 2025-05-07T20:32:42.6651205Z ) -> None: 2025-05-07T20:32:42.6651309Z torch.manual_seed(2025) 2025-05-07T20:32:42.6651390Z 2025-05-07T20:32:42.6651568Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.6651646Z 2025-05-07T20:32:42.6651741Z x_sign = torch.sign(x) 2025-05-07T20:32:42.6651869Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.6651964Z x = x_sign * x_clamp 2025-05-07T20:32:42.6652047Z x0 = x[:, :D] 2025-05-07T20:32:42.6652133Z x1 = x[:, D:] 2025-05-07T20:32:42.6652213Z 2025-05-07T20:32:42.6652298Z if contiguous: 2025-05-07T20:32:42.6652393Z x0 = x0.contiguous() 2025-05-07T20:32:42.6652489Z x1 = x1.contiguous() 2025-05-07T20:32:42.6652565Z 2025-05-07T20:32:42.6652659Z if scale_ub is not None: 2025-05-07T20:32:42.6652772Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.6652910Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.6652997Z ) 2025-05-07T20:32:42.6653159Z else: 2025-05-07T20:32:42.6653254Z scale_ub_tensor = None 2025-05-07T20:32:42.6653331Z 2025-05-07T20:32:42.6653460Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.6653549Z op = silu_mul_quant 2025-05-07T20:32:42.6653634Z if compiled: 2025-05-07T20:32:42.6653734Z op = torch.compile(op) 2025-05-07T20:32:42.6653838Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.6653918Z 2025-05-07T20:32:42.6654008Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.6654013Z 2025-05-07T20:32:42.6654110Z moe/activation_test.py:117: 2025-05-07T20:32:42.6654248Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.6654353Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.6654462Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.6655012Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.6655113Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.6655509Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.6655749Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.6656118Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.6656262Z kernel = self.compile( 2025-05-07T20:32:42.6656672Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.6656853Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.6657024Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.6657029Z 2025-05-07T20:32:42.6657242Z self = 2025-05-07T20:32:42.6658101Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.6658651Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3dd5a4d1f0>} 2025-05-07T20:32:42.6659473Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.6659672Z context = 2025-05-07T20:32:42.6659679Z 2025-05-07T20:32:42.6659852Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.6660135Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.6660242Z module_map=module_map) 2025-05-07T20:32:42.6660411Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.6660510Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.6660586Z E ^ 2025-05-07T20:32:42.6660972Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.6660979Z 2025-05-07T20:32:42.6661426Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.6661431Z 2025-05-07T20:32:42.6661535Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.6661766Z self=, 2025-05-07T20:32:42.6661848Z T=1, 2025-05-07T20:32:42.6661928Z D=7168, 2025-05-07T20:32:42.6662014Z scale_ub=1200.0, 2025-05-07T20:32:42.6662102Z contiguous=True, 2025-05-07T20:32:42.6662277Z compiled=True, 2025-05-07T20:32:42.6662359Z ) 2025-05-07T20:32:42.6662586Z self = 2025-05-07T20:32:42.6662760Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:42.6662765Z 2025-05-07T20:32:42.6662848Z @given( 2025-05-07T20:32:42.6662972Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.6663076Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.6663196Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.6663324Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.6663443Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.6663523Z ) 2025-05-07T20:32:42.6663788Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.6663886Z def test_silu_mul_quant( 2025-05-07T20:32:42.6663971Z self, 2025-05-07T20:32:42.6664057Z T: int, 2025-05-07T20:32:42.6664140Z D: int, 2025-05-07T20:32:42.6664244Z scale_ub: Optional[float], 2025-05-07T20:32:42.6664338Z contiguous: bool, 2025-05-07T20:32:42.6664427Z compiled: bool, 2025-05-07T20:32:42.6664515Z ) -> None: 2025-05-07T20:32:42.6664613Z torch.manual_seed(2025) 2025-05-07T20:32:42.6664689Z 2025-05-07T20:32:42.6664868Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.6664990Z 2025-05-07T20:32:42.6665083Z x_sign = torch.sign(x) 2025-05-07T20:32:42.6665215Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.6665304Z x = x_sign * x_clamp 2025-05-07T20:32:42.6665390Z x0 = x[:, :D] 2025-05-07T20:32:42.6665517Z x1 = x[:, D:] 2025-05-07T20:32:42.6665591Z 2025-05-07T20:32:42.6665678Z if contiguous: 2025-05-07T20:32:42.6665768Z x0 = x0.contiguous() 2025-05-07T20:32:42.6665857Z x1 = x1.contiguous() 2025-05-07T20:32:42.6665942Z 2025-05-07T20:32:42.6666035Z if scale_ub is not None: 2025-05-07T20:32:42.6666140Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.6666278Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.6666356Z ) 2025-05-07T20:32:42.6666432Z else: 2025-05-07T20:32:42.6666527Z scale_ub_tensor = None 2025-05-07T20:32:42.6666604Z 2025-05-07T20:32:42.6666732Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.6666828Z op = silu_mul_quant 2025-05-07T20:32:42.6666914Z if compiled: 2025-05-07T20:32:42.6667021Z op = torch.compile(op) 2025-05-07T20:32:42.6667129Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.6667206Z 2025-05-07T20:32:42.6667304Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.6667308Z 2025-05-07T20:32:42.6667407Z moe/activation_test.py:117: 2025-05-07T20:32:42.6667546Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.6667654Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.6667756Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.6668155Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:42.6668255Z return fn(*args, **kwargs) 2025-05-07T20:32:42.6668801Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.6668909Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.6669296Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.6669536Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.6670016Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.6670197Z kernel = self.compile( 2025-05-07T20:32:42.6670616Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.6670794Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.6670926Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.6670933Z 2025-05-07T20:32:42.6671147Z self = 2025-05-07T20:32:42.6671995Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.6672553Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3dd4723820>} 2025-05-07T20:32:42.6673371Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.6673568Z context = 2025-05-07T20:32:42.6673573Z 2025-05-07T20:32:42.6673743Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.6674674Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.6674787Z module_map=module_map) 2025-05-07T20:32:42.6674953Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.6675055Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.6675180Z E ^ 2025-05-07T20:32:42.6675560Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.6675564Z 2025-05-07T20:32:42.6676016Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.6676024Z 2025-05-07T20:32:42.6676128Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.6676359Z self=, 2025-05-07T20:32:42.6676444Z T=1, 2025-05-07T20:32:42.6676535Z D=7168, 2025-05-07T20:32:42.6676626Z scale_ub=1200.0, 2025-05-07T20:32:42.6676720Z contiguous=False, 2025-05-07T20:32:42.6676810Z compiled=True, 2025-05-07T20:32:42.6676888Z ) 2025-05-07T20:32:42.6677120Z self = 2025-05-07T20:32:42.6677296Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:42.6677304Z 2025-05-07T20:32:42.6677387Z @given( 2025-05-07T20:32:42.6677510Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.6677616Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.6677739Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.6677866Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.6677983Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.6678066Z ) 2025-05-07T20:32:42.6678328Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.6678429Z def test_silu_mul_quant( 2025-05-07T20:32:42.6678515Z self, 2025-05-07T20:32:42.6678597Z T: int, 2025-05-07T20:32:42.6678677Z D: int, 2025-05-07T20:32:42.6678782Z scale_ub: Optional[float], 2025-05-07T20:32:42.6678874Z contiguous: bool, 2025-05-07T20:32:42.6678970Z compiled: bool, 2025-05-07T20:32:42.6679057Z ) -> None: 2025-05-07T20:32:42.6679156Z torch.manual_seed(2025) 2025-05-07T20:32:42.6679234Z 2025-05-07T20:32:42.6679412Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.6679572Z 2025-05-07T20:32:42.6679669Z x_sign = torch.sign(x) 2025-05-07T20:32:42.6679796Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.6679884Z x = x_sign * x_clamp 2025-05-07T20:32:42.6679969Z x0 = x[:, :D] 2025-05-07T20:32:42.6680048Z x1 = x[:, D:] 2025-05-07T20:32:42.6680122Z 2025-05-07T20:32:42.6680212Z if contiguous: 2025-05-07T20:32:42.6680309Z x0 = x0.contiguous() 2025-05-07T20:32:42.6680402Z x1 = x1.contiguous() 2025-05-07T20:32:42.6680480Z 2025-05-07T20:32:42.6680575Z if scale_ub is not None: 2025-05-07T20:32:42.6680686Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.6680823Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.6680907Z ) 2025-05-07T20:32:42.6680993Z else: 2025-05-07T20:32:42.6681093Z scale_ub_tensor = None 2025-05-07T20:32:42.6681172Z 2025-05-07T20:32:42.6681311Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.6681407Z op = silu_mul_quant 2025-05-07T20:32:42.6681496Z if compiled: 2025-05-07T20:32:42.6681603Z op = torch.compile(op) 2025-05-07T20:32:42.6681715Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.6681801Z 2025-05-07T20:32:42.6681895Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.6681966Z 2025-05-07T20:32:42.6682066Z moe/activation_test.py:117: 2025-05-07T20:32:42.6682203Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.6682308Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.6682412Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.6683022Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:42.6683284Z return fn(*args, **kwargs) 2025-05-07T20:32:42.6683832Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.6683936Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.6684323Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.6684564Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.6684930Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.6685030Z kernel = self.compile( 2025-05-07T20:32:42.6685448Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.6685629Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.6685772Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.6685777Z 2025-05-07T20:32:42.6685996Z self = 2025-05-07T20:32:42.6686849Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.6687402Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3dd46a64c0>} 2025-05-07T20:32:42.6688220Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.6688425Z context = 2025-05-07T20:32:42.6688430Z 2025-05-07T20:32:42.6688602Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.6689000Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.6689118Z module_map=module_map) 2025-05-07T20:32:42.6689284Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.6689392Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.6689469Z E ^ 2025-05-07T20:32:42.6689854Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.6689861Z 2025-05-07T20:32:42.6690317Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.6690322Z 2025-05-07T20:32:42.6690427Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.6690669Z self=, 2025-05-07T20:32:42.6690744Z T=1, 2025-05-07T20:32:42.6690823Z D=7168, 2025-05-07T20:32:42.6690913Z scale_ub=None, 2025-05-07T20:32:42.6691005Z contiguous=False, 2025-05-07T20:32:42.6691088Z compiled=True, 2025-05-07T20:32:42.6691167Z ) 2025-05-07T20:32:42.6691392Z self = 2025-05-07T20:32:42.6691561Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:42.6691566Z 2025-05-07T20:32:42.6691649Z @given( 2025-05-07T20:32:42.6691830Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.6691933Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.6692050Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.6692166Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.6692283Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.6692404Z ) 2025-05-07T20:32:42.6692663Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.6692765Z def test_silu_mul_quant( 2025-05-07T20:32:42.6692852Z self, 2025-05-07T20:32:42.6692935Z T: int, 2025-05-07T20:32:42.6693022Z D: int, 2025-05-07T20:32:42.6693124Z scale_ub: Optional[float], 2025-05-07T20:32:42.6693218Z contiguous: bool, 2025-05-07T20:32:42.6693309Z compiled: bool, 2025-05-07T20:32:42.6693391Z ) -> None: 2025-05-07T20:32:42.6693491Z torch.manual_seed(2025) 2025-05-07T20:32:42.6693570Z 2025-05-07T20:32:42.6693745Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.6693831Z 2025-05-07T20:32:42.6693925Z x_sign = torch.sign(x) 2025-05-07T20:32:42.6694052Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.6694145Z x = x_sign * x_clamp 2025-05-07T20:32:42.6694244Z x0 = x[:, :D] 2025-05-07T20:32:42.6694328Z x1 = x[:, D:] 2025-05-07T20:32:42.6694406Z 2025-05-07T20:32:42.6694496Z if contiguous: 2025-05-07T20:32:42.6694592Z x0 = x0.contiguous() 2025-05-07T20:32:42.6694691Z x1 = x1.contiguous() 2025-05-07T20:32:42.6694767Z 2025-05-07T20:32:42.6694863Z if scale_ub is not None: 2025-05-07T20:32:42.6694974Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.6695112Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.6695193Z ) 2025-05-07T20:32:42.6695277Z else: 2025-05-07T20:32:42.6695377Z scale_ub_tensor = None 2025-05-07T20:32:42.6695453Z 2025-05-07T20:32:42.6695589Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.6695680Z op = silu_mul_quant 2025-05-07T20:32:42.6695767Z if compiled: 2025-05-07T20:32:42.6695875Z op = torch.compile(op) 2025-05-07T20:32:42.6695987Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.6696066Z 2025-05-07T20:32:42.6696160Z y_fp8, y_scale = fn() 2025-05-07T20:32:42.6696285Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:42.6696451Z 2025-05-07T20:32:42.6696590Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.6696692Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:42.6696798Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:42.6696920Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:42.6697060Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:42.6697139Z 2025-05-07T20:32:42.6697238Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:42.6697242Z 2025-05-07T20:32:42.6697344Z moe/activation_test.py:126: 2025-05-07T20:32:42.6697475Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.6697576Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:42.6697717Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:42.6698329Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:42.6698430Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:42.6698819Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.6699051Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.6699445Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:42.6699756Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:42.6700187Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:42.6700496Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:42.6700902Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:42.6701076Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:42.6701439Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:42.6701521Z fn() 2025-05-07T20:32:42.6701954Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:42.6702043Z self.fn.run( 2025-05-07T20:32:42.6702405Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.6702508Z kernel = self.compile( 2025-05-07T20:32:42.6702918Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.6703106Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.6703244Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.6703249Z 2025-05-07T20:32:42.6703465Z self = 2025-05-07T20:32:42.6704320Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.6704873Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3dd4636040>} 2025-05-07T20:32:42.6705692Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.6705895Z context = 2025-05-07T20:32:42.6705899Z 2025-05-07T20:32:42.6706160Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.6706443Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.6706550Z module_map=module_map) 2025-05-07T20:32:42.6706719Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.6706820Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:42.6706904Z E ^ 2025-05-07T20:32:42.6707296Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.6707301Z 2025-05-07T20:32:42.6707750Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.6707758Z 2025-05-07T20:32:42.6707866Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.6708100Z self=, 2025-05-07T20:32:42.6708189Z T=1, 2025-05-07T20:32:42.6708270Z D=5120, 2025-05-07T20:32:42.6708354Z scale_ub=1200.0, 2025-05-07T20:32:42.6708448Z contiguous=False, 2025-05-07T20:32:42.6708536Z compiled=True, 2025-05-07T20:32:42.6708613Z ) 2025-05-07T20:32:42.6708840Z self = 2025-05-07T20:32:42.6709019Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:42.6709065Z 2025-05-07T20:32:42.6709143Z @given( 2025-05-07T20:32:42.6709264Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.6709362Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.6709477Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.6709599Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.6709828Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.6709906Z ) 2025-05-07T20:32:42.6710172Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.6710267Z def test_silu_mul_quant( 2025-05-07T20:32:42.6710347Z self, 2025-05-07T20:32:42.6710433Z T: int, 2025-05-07T20:32:42.6710516Z D: int, 2025-05-07T20:32:42.6710618Z scale_ub: Optional[float], 2025-05-07T20:32:42.6710713Z contiguous: bool, 2025-05-07T20:32:42.6710800Z compiled: bool, 2025-05-07T20:32:42.6710882Z ) -> None: 2025-05-07T20:32:42.6710982Z torch.manual_seed(2025) 2025-05-07T20:32:42.6711060Z 2025-05-07T20:32:42.6711239Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.6711320Z 2025-05-07T20:32:42.6711416Z x_sign = torch.sign(x) 2025-05-07T20:32:42.6711547Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.6711644Z x = x_sign * x_clamp 2025-05-07T20:32:42.6711729Z x0 = x[:, :D] 2025-05-07T20:32:42.6711814Z x1 = x[:, D:] 2025-05-07T20:32:42.6711888Z 2025-05-07T20:32:42.6711976Z if contiguous: 2025-05-07T20:32:42.6712072Z x0 = x0.contiguous() 2025-05-07T20:32:42.6712163Z x1 = x1.contiguous() 2025-05-07T20:32:42.6712241Z 2025-05-07T20:32:42.6712336Z if scale_ub is not None: 2025-05-07T20:32:42.6712444Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.6712585Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.6712669Z ) 2025-05-07T20:32:42.6712746Z else: 2025-05-07T20:32:42.6712846Z scale_ub_tensor = None 2025-05-07T20:32:42.6712922Z 2025-05-07T20:32:42.6713054Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.6713151Z op = silu_mul_quant 2025-05-07T20:32:42.6713237Z if compiled: 2025-05-07T20:32:42.6713340Z op = torch.compile(op) 2025-05-07T20:32:42.6713453Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.6713529Z 2025-05-07T20:32:42.6713728Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.6713737Z 2025-05-07T20:32:42.6713840Z moe/activation_test.py:117: 2025-05-07T20:32:42.6713974Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.6714082Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.6714182Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.6714576Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:42.6714675Z return fn(*args, **kwargs) 2025-05-07T20:32:42.6715217Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.6715317Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.6715702Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.6715935Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.6716306Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.6716400Z kernel = self.compile( 2025-05-07T20:32:42.6716810Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.6716993Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.6717169Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.6717174Z 2025-05-07T20:32:42.6717390Z self = 2025-05-07T20:32:42.6718242Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.6718836Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3dd4636f70>} 2025-05-07T20:32:42.6719651Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.6719848Z context = 2025-05-07T20:32:42.6719855Z 2025-05-07T20:32:42.6720026Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.6720302Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.6720412Z module_map=module_map) 2025-05-07T20:32:42.6720588Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.6720689Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.6720774Z E ^ 2025-05-07T20:32:42.6721163Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.6721167Z 2025-05-07T20:32:42.6721614Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.6721619Z 2025-05-07T20:32:42.6721727Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.6721965Z self=, 2025-05-07T20:32:42.6722052Z T=1, 2025-05-07T20:32:42.6722132Z D=5120, 2025-05-07T20:32:42.6722217Z scale_ub=1200.0, 2025-05-07T20:32:42.6722306Z contiguous=False, 2025-05-07T20:32:42.6722390Z compiled=False, 2025-05-07T20:32:42.6722469Z ) 2025-05-07T20:32:42.6722704Z self = 2025-05-07T20:32:42.6722879Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:42.6722883Z 2025-05-07T20:32:42.6723046Z @given( 2025-05-07T20:32:42.6723168Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.6723268Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.6723385Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.6723501Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.6723613Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.6723689Z ) 2025-05-07T20:32:42.6723947Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.6724038Z def test_silu_mul_quant( 2025-05-07T20:32:42.6724121Z self, 2025-05-07T20:32:42.6724199Z T: int, 2025-05-07T20:32:42.6724278Z D: int, 2025-05-07T20:32:42.6724383Z scale_ub: Optional[float], 2025-05-07T20:32:42.6724476Z contiguous: bool, 2025-05-07T20:32:42.6724563Z compiled: bool, 2025-05-07T20:32:42.6724644Z ) -> None: 2025-05-07T20:32:42.6724745Z torch.manual_seed(2025) 2025-05-07T20:32:42.6724827Z 2025-05-07T20:32:42.6725001Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.6725079Z 2025-05-07T20:32:42.6725176Z x_sign = torch.sign(x) 2025-05-07T20:32:42.6725303Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.6725396Z x = x_sign * x_clamp 2025-05-07T20:32:42.6725482Z x0 = x[:, :D] 2025-05-07T20:32:42.6725606Z x1 = x[:, D:] 2025-05-07T20:32:42.6725679Z 2025-05-07T20:32:42.6725764Z if contiguous: 2025-05-07T20:32:42.6725854Z x0 = x0.contiguous() 2025-05-07T20:32:42.6725943Z x1 = x1.contiguous() 2025-05-07T20:32:42.6726019Z 2025-05-07T20:32:42.6726108Z if scale_ub is not None: 2025-05-07T20:32:42.6726254Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.6726391Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.6726469Z ) 2025-05-07T20:32:42.6726553Z else: 2025-05-07T20:32:42.6726644Z scale_ub_tensor = None 2025-05-07T20:32:42.6726716Z 2025-05-07T20:32:42.6726848Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.6726938Z op = silu_mul_quant 2025-05-07T20:32:42.6727021Z if compiled: 2025-05-07T20:32:42.6727127Z op = torch.compile(op) 2025-05-07T20:32:42.6727234Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.6727310Z 2025-05-07T20:32:42.6727404Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.6727408Z 2025-05-07T20:32:42.6727509Z moe/activation_test.py:117: 2025-05-07T20:32:42.6727647Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.6727749Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.6727854Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.6728405Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.6728506Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.6728891Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.6729135Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.6729500Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.6729601Z kernel = self.compile( 2025-05-07T20:32:42.6730011Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.6730190Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.6730337Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.6730342Z 2025-05-07T20:32:42.6730555Z self = 2025-05-07T20:32:42.6731493Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.6732042Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3dd46d1700>} 2025-05-07T20:32:42.6732856Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.6733055Z context = 2025-05-07T20:32:42.6733061Z 2025-05-07T20:32:42.6733231Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.6733513Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.6733620Z module_map=module_map) 2025-05-07T20:32:42.6733782Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.6733883Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.6733962Z E ^ 2025-05-07T20:32:42.6734347Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.6734396Z 2025-05-07T20:32:42.6734840Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.6734844Z 2025-05-07T20:32:42.6734945Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.6735181Z self=, 2025-05-07T20:32:42.6735302Z T=16384, 2025-05-07T20:32:42.6735380Z D=5120, 2025-05-07T20:32:42.6735471Z scale_ub=1200.0, 2025-05-07T20:32:42.6735560Z contiguous=False, 2025-05-07T20:32:42.6735642Z compiled=True, 2025-05-07T20:32:42.6735716Z ) 2025-05-07T20:32:42.6735943Z self = 2025-05-07T20:32:42.6736131Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:42.6736136Z 2025-05-07T20:32:42.6736213Z @given( 2025-05-07T20:32:42.6736332Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.6736435Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.6736554Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.6736672Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.6736790Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.6736867Z ) 2025-05-07T20:32:42.6737134Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.6737231Z def test_silu_mul_quant( 2025-05-07T20:32:42.6737310Z self, 2025-05-07T20:32:42.6737395Z T: int, 2025-05-07T20:32:42.6737473Z D: int, 2025-05-07T20:32:42.6737575Z scale_ub: Optional[float], 2025-05-07T20:32:42.6737671Z contiguous: bool, 2025-05-07T20:32:42.6737758Z compiled: bool, 2025-05-07T20:32:42.6737837Z ) -> None: 2025-05-07T20:32:42.6737936Z torch.manual_seed(2025) 2025-05-07T20:32:42.6738012Z 2025-05-07T20:32:42.6738186Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.6738269Z 2025-05-07T20:32:42.6738362Z x_sign = torch.sign(x) 2025-05-07T20:32:42.6738488Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.6738581Z x = x_sign * x_clamp 2025-05-07T20:32:42.6738663Z x0 = x[:, :D] 2025-05-07T20:32:42.6738752Z x1 = x[:, D:] 2025-05-07T20:32:42.6738827Z 2025-05-07T20:32:42.6738911Z if contiguous: 2025-05-07T20:32:42.6739007Z x0 = x0.contiguous() 2025-05-07T20:32:42.6739179Z x1 = x1.contiguous() 2025-05-07T20:32:42.6739256Z 2025-05-07T20:32:42.6739350Z if scale_ub is not None: 2025-05-07T20:32:42.6739455Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.6739593Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.6739674Z ) 2025-05-07T20:32:42.6739754Z else: 2025-05-07T20:32:42.6739847Z scale_ub_tensor = None 2025-05-07T20:32:42.6739924Z 2025-05-07T20:32:42.6740053Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.6740145Z op = silu_mul_quant 2025-05-07T20:32:42.6740228Z if compiled: 2025-05-07T20:32:42.6740326Z op = torch.compile(op) 2025-05-07T20:32:42.6740436Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.6740513Z 2025-05-07T20:32:42.6740602Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.6740606Z 2025-05-07T20:32:42.6740705Z moe/activation_test.py:117: 2025-05-07T20:32:42.6740842Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.6740942Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.6741042Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.6741434Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:42.6741527Z return fn(*args, **kwargs) 2025-05-07T20:32:42.6742131Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.6742229Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.6742616Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.6742887Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.6743250Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.6743350Z kernel = self.compile( 2025-05-07T20:32:42.6743759Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.6743941Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.6744070Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.6744077Z 2025-05-07T20:32:42.6744288Z self = 2025-05-07T20:32:42.6745139Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.6745691Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3dd4addd30>} 2025-05-07T20:32:42.6746511Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.6746709Z context = 2025-05-07T20:32:42.6746714Z 2025-05-07T20:32:42.6746884Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.6747163Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.6747267Z module_map=module_map) 2025-05-07T20:32:42.6747430Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.6747528Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.6747606Z E ^ 2025-05-07T20:32:42.6747990Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.6748097Z 2025-05-07T20:32:42.6748545Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.6748549Z 2025-05-07T20:32:42.6748654Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.6748886Z self=, 2025-05-07T20:32:42.6748967Z T=2048, 2025-05-07T20:32:42.6749056Z D=7168, 2025-05-07T20:32:42.6749143Z scale_ub=1200.0, 2025-05-07T20:32:42.6749233Z contiguous=False, 2025-05-07T20:32:42.6749323Z compiled=True, 2025-05-07T20:32:42.6749398Z ) 2025-05-07T20:32:42.6749628Z self = 2025-05-07T20:32:42.6749912Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:42.6749919Z 2025-05-07T20:32:42.6750000Z @given( 2025-05-07T20:32:42.6750127Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.6750234Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.6750353Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.6750481Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.6750596Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.6750675Z ) 2025-05-07T20:32:42.6750939Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.6751080Z def test_silu_mul_quant( 2025-05-07T20:32:42.6751162Z self, 2025-05-07T20:32:42.6751240Z T: int, 2025-05-07T20:32:42.6751318Z D: int, 2025-05-07T20:32:42.6751423Z scale_ub: Optional[float], 2025-05-07T20:32:42.6751516Z contiguous: bool, 2025-05-07T20:32:42.6751606Z compiled: bool, 2025-05-07T20:32:42.6751733Z ) -> None: 2025-05-07T20:32:42.6752032Z torch.manual_seed(2025) 2025-05-07T20:32:42.6752108Z 2025-05-07T20:32:42.6752292Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.6752366Z 2025-05-07T20:32:42.6752458Z x_sign = torch.sign(x) 2025-05-07T20:32:42.6752590Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.6752682Z x = x_sign * x_clamp 2025-05-07T20:32:42.6752762Z x0 = x[:, :D] 2025-05-07T20:32:42.6752848Z x1 = x[:, D:] 2025-05-07T20:32:42.6752922Z 2025-05-07T20:32:42.6753011Z if contiguous: 2025-05-07T20:32:42.6753106Z x0 = x0.contiguous() 2025-05-07T20:32:42.6753195Z x1 = x1.contiguous() 2025-05-07T20:32:42.6753272Z 2025-05-07T20:32:42.6753367Z if scale_ub is not None: 2025-05-07T20:32:42.6753476Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.6753618Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.6753703Z ) 2025-05-07T20:32:42.6753782Z else: 2025-05-07T20:32:42.6753878Z scale_ub_tensor = None 2025-05-07T20:32:42.6753950Z 2025-05-07T20:32:42.6754088Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.6754183Z op = silu_mul_quant 2025-05-07T20:32:42.6754269Z if compiled: 2025-05-07T20:32:42.6754374Z op = torch.compile(op) 2025-05-07T20:32:42.6754481Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.6754554Z 2025-05-07T20:32:42.6754649Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.6754656Z 2025-05-07T20:32:42.6754753Z moe/activation_test.py:117: 2025-05-07T20:32:42.6754887Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.6754994Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.6755097Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.6755492Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:42.6755592Z return fn(*args, **kwargs) 2025-05-07T20:32:42.6756218Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.6756322Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.6756707Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.6756944Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.6757320Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.6757417Z kernel = self.compile( 2025-05-07T20:32:42.6761734Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.6761942Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.6762085Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.6762090Z 2025-05-07T20:32:42.6762315Z self = 2025-05-07T20:32:42.6763179Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.6763741Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3dd4059b80>} 2025-05-07T20:32:42.6764639Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.6764887Z context = 2025-05-07T20:32:42.6764892Z 2025-05-07T20:32:42.6765066Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.6765354Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.6765469Z module_map=module_map) 2025-05-07T20:32:42.6765641Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.6765750Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.6765831Z E ^ 2025-05-07T20:32:42.6766220Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.6766227Z 2025-05-07T20:32:42.6766687Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.6766691Z 2025-05-07T20:32:42.6766799Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.6767041Z self=, 2025-05-07T20:32:42.6767123Z T=1, 2025-05-07T20:32:42.6767206Z D=5120, 2025-05-07T20:32:42.6767300Z scale_ub=None, 2025-05-07T20:32:42.6767395Z contiguous=False, 2025-05-07T20:32:42.6767483Z compiled=False, 2025-05-07T20:32:42.6767565Z ) 2025-05-07T20:32:42.6767797Z self = 2025-05-07T20:32:42.6767976Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:42.6767983Z 2025-05-07T20:32:42.6768066Z @given( 2025-05-07T20:32:42.6768194Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.6768307Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.6768427Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.6768550Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.6768677Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.6768759Z ) 2025-05-07T20:32:42.6769023Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.6769124Z def test_silu_mul_quant( 2025-05-07T20:32:42.6769290Z self, 2025-05-07T20:32:42.6769373Z T: int, 2025-05-07T20:32:42.6769457Z D: int, 2025-05-07T20:32:42.6769560Z scale_ub: Optional[float], 2025-05-07T20:32:42.6769656Z contiguous: bool, 2025-05-07T20:32:42.6769747Z compiled: bool, 2025-05-07T20:32:42.6769831Z ) -> None: 2025-05-07T20:32:42.6769934Z torch.manual_seed(2025) 2025-05-07T20:32:42.6770015Z 2025-05-07T20:32:42.6770193Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.6770275Z 2025-05-07T20:32:42.6770372Z x_sign = torch.sign(x) 2025-05-07T20:32:42.6770503Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.6770601Z x = x_sign * x_clamp 2025-05-07T20:32:42.6770692Z x0 = x[:, :D] 2025-05-07T20:32:42.6770775Z x1 = x[:, D:] 2025-05-07T20:32:42.6770855Z 2025-05-07T20:32:42.6770949Z if contiguous: 2025-05-07T20:32:42.6771045Z x0 = x0.contiguous() 2025-05-07T20:32:42.6771147Z x1 = x1.contiguous() 2025-05-07T20:32:42.6771228Z 2025-05-07T20:32:42.6771327Z if scale_ub is not None: 2025-05-07T20:32:42.6771440Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.6771587Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.6771668Z ) 2025-05-07T20:32:42.6771748Z else: 2025-05-07T20:32:42.6771889Z scale_ub_tensor = None 2025-05-07T20:32:42.6771972Z 2025-05-07T20:32:42.6772108Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.6772203Z op = silu_mul_quant 2025-05-07T20:32:42.6772296Z if compiled: 2025-05-07T20:32:42.6772401Z op = torch.compile(op) 2025-05-07T20:32:42.6772553Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.6772630Z 2025-05-07T20:32:42.6772725Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.6772730Z 2025-05-07T20:32:42.6772842Z moe/activation_test.py:117: 2025-05-07T20:32:42.6772982Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.6773090Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.6773197Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.6773744Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.6773852Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.6774247Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.6774489Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.6774865Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.6774965Z kernel = self.compile( 2025-05-07T20:32:42.6775385Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.6775576Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.6775713Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.6775718Z 2025-05-07T20:32:42.6775935Z self = 2025-05-07T20:32:42.6776797Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.6777355Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3dd45925e0>} 2025-05-07T20:32:42.6778263Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.6778466Z context = 2025-05-07T20:32:42.6778470Z 2025-05-07T20:32:42.6778650Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.6778933Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.6779049Z module_map=module_map) 2025-05-07T20:32:42.6779222Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.6779325Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.6779406Z E ^ 2025-05-07T20:32:42.6779800Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.6779807Z 2025-05-07T20:32:42.6780266Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.6780270Z 2025-05-07T20:32:42.6780384Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.6780627Z self=, 2025-05-07T20:32:42.6780711Z T=4096, 2025-05-07T20:32:42.6780795Z D=7168, 2025-05-07T20:32:42.6780884Z scale_ub=1200.0, 2025-05-07T20:32:42.6780975Z contiguous=False, 2025-05-07T20:32:42.6781108Z compiled=False, 2025-05-07T20:32:42.6781183Z ) 2025-05-07T20:32:42.6781417Z self = 2025-05-07T20:32:42.6781605Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:42.6781609Z 2025-05-07T20:32:42.6781691Z @given( 2025-05-07T20:32:42.6781888Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.6781991Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.6782114Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.6782245Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.6782364Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.6782448Z ) 2025-05-07T20:32:42.6782715Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.6783020Z def test_silu_mul_quant( 2025-05-07T20:32:42.6783140Z self, 2025-05-07T20:32:42.6783238Z T: int, 2025-05-07T20:32:42.6783323Z D: int, 2025-05-07T20:32:42.6783429Z scale_ub: Optional[float], 2025-05-07T20:32:42.6783521Z contiguous: bool, 2025-05-07T20:32:42.6783609Z compiled: bool, 2025-05-07T20:32:42.6783690Z ) -> None: 2025-05-07T20:32:42.6783787Z torch.manual_seed(2025) 2025-05-07T20:32:42.6783860Z 2025-05-07T20:32:42.6784043Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.6784121Z 2025-05-07T20:32:42.6784216Z x_sign = torch.sign(x) 2025-05-07T20:32:42.6784354Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.6784446Z x = x_sign * x_clamp 2025-05-07T20:32:42.6784533Z x0 = x[:, :D] 2025-05-07T20:32:42.6784615Z x1 = x[:, D:] 2025-05-07T20:32:42.6784689Z 2025-05-07T20:32:42.6784779Z if contiguous: 2025-05-07T20:32:42.6784874Z x0 = x0.contiguous() 2025-05-07T20:32:42.6784967Z x1 = x1.contiguous() 2025-05-07T20:32:42.6785049Z 2025-05-07T20:32:42.6785143Z if scale_ub is not None: 2025-05-07T20:32:42.6785250Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.6785391Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.6785469Z ) 2025-05-07T20:32:42.6785546Z else: 2025-05-07T20:32:42.6785646Z scale_ub_tensor = None 2025-05-07T20:32:42.6785725Z 2025-05-07T20:32:42.6785862Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.6785955Z op = silu_mul_quant 2025-05-07T20:32:42.6786191Z if compiled: 2025-05-07T20:32:42.6786298Z op = torch.compile(op) 2025-05-07T20:32:42.6786407Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.6786481Z 2025-05-07T20:32:42.6786578Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.6786582Z 2025-05-07T20:32:42.6786685Z moe/activation_test.py:117: 2025-05-07T20:32:42.6786830Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.6786948Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.6787057Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.6787679Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.6787783Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.6788216Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.6788479Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.6788888Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.6788984Z kernel = self.compile( 2025-05-07T20:32:42.6789452Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.6789648Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.6789965Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.6789969Z 2025-05-07T20:32:42.6790185Z self = 2025-05-07T20:32:42.6791035Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.6791660Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3dd45928b0>} 2025-05-07T20:32:42.6792476Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.6792685Z context = 2025-05-07T20:32:42.6792689Z 2025-05-07T20:32:42.6792862Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.6793144Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.6793258Z module_map=module_map) 2025-05-07T20:32:42.6793423Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.6793527Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.6793609Z E ^ 2025-05-07T20:32:42.6793996Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.6794001Z 2025-05-07T20:32:42.6794454Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.6794458Z 2025-05-07T20:32:42.6794565Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.6794805Z self=, 2025-05-07T20:32:42.6794884Z T=16384, 2025-05-07T20:32:42.6794964Z D=7168, 2025-05-07T20:32:42.6795052Z scale_ub=None, 2025-05-07T20:32:42.6795141Z contiguous=True, 2025-05-07T20:32:42.6795228Z compiled=True, 2025-05-07T20:32:42.6795308Z ) 2025-05-07T20:32:42.6795536Z self = 2025-05-07T20:32:42.6795718Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:42.6795806Z 2025-05-07T20:32:42.6795888Z @given( 2025-05-07T20:32:42.6796011Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.6796111Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.6796224Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.6796341Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.6796456Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.6796533Z ) 2025-05-07T20:32:42.6796790Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.6796889Z def test_silu_mul_quant( 2025-05-07T20:32:42.6796968Z self, 2025-05-07T20:32:42.6797048Z T: int, 2025-05-07T20:32:42.6797137Z D: int, 2025-05-07T20:32:42.6797238Z scale_ub: Optional[float], 2025-05-07T20:32:42.6797330Z contiguous: bool, 2025-05-07T20:32:42.6797419Z compiled: bool, 2025-05-07T20:32:42.6797505Z ) -> None: 2025-05-07T20:32:42.6797612Z torch.manual_seed(2025) 2025-05-07T20:32:42.6797690Z 2025-05-07T20:32:42.6797865Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.6797942Z 2025-05-07T20:32:42.6798042Z x_sign = torch.sign(x) 2025-05-07T20:32:42.6798169Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.6798263Z x = x_sign * x_clamp 2025-05-07T20:32:42.6798391Z x0 = x[:, :D] 2025-05-07T20:32:42.6798474Z x1 = x[:, D:] 2025-05-07T20:32:42.6798547Z 2025-05-07T20:32:42.6798634Z if contiguous: 2025-05-07T20:32:42.6798728Z x0 = x0.contiguous() 2025-05-07T20:32:42.6798824Z x1 = x1.contiguous() 2025-05-07T20:32:42.6798901Z 2025-05-07T20:32:42.6799037Z if scale_ub is not None: 2025-05-07T20:32:42.6799141Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.6799276Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.6799360Z ) 2025-05-07T20:32:42.6799438Z else: 2025-05-07T20:32:42.6799534Z scale_ub_tensor = None 2025-05-07T20:32:42.6799614Z 2025-05-07T20:32:42.6799748Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.6799839Z op = silu_mul_quant 2025-05-07T20:32:42.6799931Z if compiled: 2025-05-07T20:32:42.6800033Z op = torch.compile(op) 2025-05-07T20:32:42.6800144Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.6800225Z 2025-05-07T20:32:42.6800318Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.6800323Z 2025-05-07T20:32:42.6800427Z moe/activation_test.py:117: 2025-05-07T20:32:42.6800564Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.6800670Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.6800777Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.6801175Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:42.6801272Z return fn(*args, **kwargs) 2025-05-07T20:32:42.6801812Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.6801912Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.6802301Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.6802540Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.6802905Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.6803003Z kernel = self.compile( 2025-05-07T20:32:42.6803419Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.6803599Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.6803817Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.6803822Z 2025-05-07T20:32:42.6804034Z self = 2025-05-07T20:32:42.6804883Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.6805433Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3dd42abc10>} 2025-05-07T20:32:42.6806247Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.6806452Z context = 2025-05-07T20:32:42.6806456Z 2025-05-07T20:32:42.6806626Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.6806903Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.6807011Z module_map=module_map) 2025-05-07T20:32:42.6807180Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.6807321Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.6807398Z E ^ 2025-05-07T20:32:42.6807784Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.6807789Z 2025-05-07T20:32:42.6808238Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.6808284Z 2025-05-07T20:32:42.6808392Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.6808629Z self=, 2025-05-07T20:32:42.6808710Z T=4096, 2025-05-07T20:32:42.6808793Z D=5120, 2025-05-07T20:32:42.6808877Z scale_ub=None, 2025-05-07T20:32:42.6808965Z contiguous=False, 2025-05-07T20:32:42.6809053Z compiled=True, 2025-05-07T20:32:42.6809131Z ) 2025-05-07T20:32:42.6809363Z self = 2025-05-07T20:32:42.6809550Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:42.6809554Z 2025-05-07T20:32:42.6809633Z @given( 2025-05-07T20:32:42.6809756Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.6809858Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.6809979Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.6810101Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.6810217Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.6810299Z ) 2025-05-07T20:32:42.6810566Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.6810661Z def test_silu_mul_quant( 2025-05-07T20:32:42.6810742Z self, 2025-05-07T20:32:42.6810827Z T: int, 2025-05-07T20:32:42.6810911Z D: int, 2025-05-07T20:32:42.6811014Z scale_ub: Optional[float], 2025-05-07T20:32:42.6811108Z contiguous: bool, 2025-05-07T20:32:42.6811201Z compiled: bool, 2025-05-07T20:32:42.6811287Z ) -> None: 2025-05-07T20:32:42.6811383Z torch.manual_seed(2025) 2025-05-07T20:32:42.6811462Z 2025-05-07T20:32:42.6811644Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.6811724Z 2025-05-07T20:32:42.6811822Z x_sign = torch.sign(x) 2025-05-07T20:32:42.6811955Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.6812046Z x = x_sign * x_clamp 2025-05-07T20:32:42.6812127Z x0 = x[:, :D] 2025-05-07T20:32:42.6812297Z x1 = x[:, D:] 2025-05-07T20:32:42.6812375Z 2025-05-07T20:32:42.6812458Z if contiguous: 2025-05-07T20:32:42.6812551Z x0 = x0.contiguous() 2025-05-07T20:32:42.6812644Z x1 = x1.contiguous() 2025-05-07T20:32:42.6812722Z 2025-05-07T20:32:42.6812821Z if scale_ub is not None: 2025-05-07T20:32:42.6812934Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.6813076Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.6813151Z ) 2025-05-07T20:32:42.6813229Z else: 2025-05-07T20:32:42.6813329Z scale_ub_tensor = None 2025-05-07T20:32:42.6813406Z 2025-05-07T20:32:42.6813541Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.6813641Z op = silu_mul_quant 2025-05-07T20:32:42.6813728Z if compiled: 2025-05-07T20:32:42.6813829Z op = torch.compile(op) 2025-05-07T20:32:42.6813946Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.6814022Z 2025-05-07T20:32:42.6814117Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.6814128Z 2025-05-07T20:32:42.6814226Z moe/activation_test.py:117: 2025-05-07T20:32:42.6814363Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.6814467Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.6814569Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.6815036Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:42.6815136Z return fn(*args, **kwargs) 2025-05-07T20:32:42.6815673Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.6815809Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.6816194Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.6816432Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.6816798Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.6816890Z kernel = self.compile( 2025-05-07T20:32:42.6817299Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.6817482Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.6817614Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.6817619Z 2025-05-07T20:32:42.6817835Z self = 2025-05-07T20:32:42.6818687Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.6819236Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3dd427d820>} 2025-05-07T20:32:42.6820055Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.6820259Z context = 2025-05-07T20:32:42.6820264Z 2025-05-07T20:32:42.6820441Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.6820719Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.6820831Z module_map=module_map) 2025-05-07T20:32:42.6820999Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.6821177Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.6821260Z E ^ 2025-05-07T20:32:42.6821640Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.6821645Z 2025-05-07T20:32:42.6822090Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.6822097Z 2025-05-07T20:32:42.6822208Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.6822439Z self=, 2025-05-07T20:32:42.6822526Z T=4096, 2025-05-07T20:32:42.6822609Z D=5120, 2025-05-07T20:32:42.6822696Z scale_ub=1200.0, 2025-05-07T20:32:42.6822787Z contiguous=False, 2025-05-07T20:32:42.6822876Z compiled=False, 2025-05-07T20:32:42.6822953Z ) 2025-05-07T20:32:42.6823184Z self = 2025-05-07T20:32:42.6823373Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:42.6823378Z 2025-05-07T20:32:42.6823459Z @given( 2025-05-07T20:32:42.6823587Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.6823688Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.6823811Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.6823931Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.6824091Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.6824171Z ) 2025-05-07T20:32:42.6824429Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.6824526Z def test_silu_mul_quant( 2025-05-07T20:32:42.6824618Z self, 2025-05-07T20:32:42.6824735Z T: int, 2025-05-07T20:32:42.6824814Z D: int, 2025-05-07T20:32:42.6824921Z scale_ub: Optional[float], 2025-05-07T20:32:42.6825013Z contiguous: bool, 2025-05-07T20:32:42.6825105Z compiled: bool, 2025-05-07T20:32:42.6825193Z ) -> None: 2025-05-07T20:32:42.6825289Z torch.manual_seed(2025) 2025-05-07T20:32:42.6825364Z 2025-05-07T20:32:42.6825552Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.6825632Z 2025-05-07T20:32:42.6825728Z x_sign = torch.sign(x) 2025-05-07T20:32:42.6825859Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.6825954Z x = x_sign * x_clamp 2025-05-07T20:32:42.6826039Z x0 = x[:, :D] 2025-05-07T20:32:42.6826122Z x1 = x[:, D:] 2025-05-07T20:32:42.6826197Z 2025-05-07T20:32:42.6826281Z if contiguous: 2025-05-07T20:32:42.6826381Z x0 = x0.contiguous() 2025-05-07T20:32:42.6826473Z x1 = x1.contiguous() 2025-05-07T20:32:42.6826557Z 2025-05-07T20:32:42.6826654Z if scale_ub is not None: 2025-05-07T20:32:42.6826763Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.6826907Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.6826986Z ) 2025-05-07T20:32:42.6827065Z else: 2025-05-07T20:32:42.6827165Z scale_ub_tensor = None 2025-05-07T20:32:42.6827243Z 2025-05-07T20:32:42.6827374Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.6827468Z op = silu_mul_quant 2025-05-07T20:32:42.6827554Z if compiled: 2025-05-07T20:32:42.6827658Z op = torch.compile(op) 2025-05-07T20:32:42.6827769Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.6827846Z 2025-05-07T20:32:42.6827940Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.6827946Z 2025-05-07T20:32:42.6828047Z moe/activation_test.py:117: 2025-05-07T20:32:42.6828181Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.6828290Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.6828393Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.6829016Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.6829117Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.6829499Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.6829802Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.6830173Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.6830267Z kernel = self.compile( 2025-05-07T20:32:42.6830680Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.6830860Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.6830990Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.6830994Z 2025-05-07T20:32:42.6831216Z self = 2025-05-07T20:32:42.6832063Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.6832614Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3dd4119280>} 2025-05-07T20:32:42.6833472Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.6833716Z context = 2025-05-07T20:32:42.6833721Z 2025-05-07T20:32:42.6833896Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.6834175Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.6834285Z module_map=module_map) 2025-05-07T20:32:42.6834447Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.6834544Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.6834626Z E ^ 2025-05-07T20:32:42.6835016Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.6835021Z 2025-05-07T20:32:42.6835472Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.6835476Z 2025-05-07T20:32:42.6835583Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.6835820Z self=, 2025-05-07T20:32:42.6835906Z T=4096, 2025-05-07T20:32:42.6835989Z D=5120, 2025-05-07T20:32:42.6836076Z scale_ub=1200.0, 2025-05-07T20:32:42.6836167Z contiguous=False, 2025-05-07T20:32:42.6836254Z compiled=True, 2025-05-07T20:32:42.6836335Z ) 2025-05-07T20:32:42.6836569Z self = 2025-05-07T20:32:42.6836754Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:42.6836760Z 2025-05-07T20:32:42.6836843Z @given( 2025-05-07T20:32:42.6836964Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.6837065Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.6837187Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.6837309Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.6837427Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.6837507Z ) 2025-05-07T20:32:42.6837767Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.6837951Z def test_silu_mul_quant( 2025-05-07T20:32:42.6838027Z self, 2025-05-07T20:32:42.6838107Z T: int, 2025-05-07T20:32:42.6838188Z D: int, 2025-05-07T20:32:42.6838290Z scale_ub: Optional[float], 2025-05-07T20:32:42.6838380Z contiguous: bool, 2025-05-07T20:32:42.6838473Z compiled: bool, 2025-05-07T20:32:42.6838555Z ) -> None: 2025-05-07T20:32:42.6838654Z torch.manual_seed(2025) 2025-05-07T20:32:42.6838734Z 2025-05-07T20:32:42.6838911Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.6838988Z 2025-05-07T20:32:42.6839084Z x_sign = torch.sign(x) 2025-05-07T20:32:42.6839212Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.6839311Z x = x_sign * x_clamp 2025-05-07T20:32:42.6839392Z x0 = x[:, :D] 2025-05-07T20:32:42.6839473Z x1 = x[:, D:] 2025-05-07T20:32:42.6839553Z 2025-05-07T20:32:42.6839641Z if contiguous: 2025-05-07T20:32:42.6839744Z x0 = x0.contiguous() 2025-05-07T20:32:42.6839839Z x1 = x1.contiguous() 2025-05-07T20:32:42.6839915Z 2025-05-07T20:32:42.6840006Z if scale_ub is not None: 2025-05-07T20:32:42.6840116Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.6840256Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.6840332Z ) 2025-05-07T20:32:42.6840458Z else: 2025-05-07T20:32:42.6840552Z scale_ub_tensor = None 2025-05-07T20:32:42.6840622Z 2025-05-07T20:32:42.6840757Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.6840846Z op = silu_mul_quant 2025-05-07T20:32:42.6840932Z if compiled: 2025-05-07T20:32:42.6841073Z op = torch.compile(op) 2025-05-07T20:32:42.6841177Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.6841254Z 2025-05-07T20:32:42.6841345Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.6841355Z 2025-05-07T20:32:42.6841453Z moe/activation_test.py:117: 2025-05-07T20:32:42.6841588Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.6841691Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.6841793Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.6842193Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:42.6842292Z return fn(*args, **kwargs) 2025-05-07T20:32:42.6842833Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.6842931Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.6843312Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.6843552Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.6843921Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.6844021Z kernel = self.compile( 2025-05-07T20:32:42.6844433Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.6844615Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.6844754Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.6844758Z 2025-05-07T20:32:42.6844973Z self = 2025-05-07T20:32:42.6845823Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.6846480Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3dd4119700>} 2025-05-07T20:32:42.6847292Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.6847498Z context = 2025-05-07T20:32:42.6847505Z 2025-05-07T20:32:42.6847680Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.6847964Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.6848073Z module_map=module_map) 2025-05-07T20:32:42.6848243Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.6848349Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.6848430Z E ^ 2025-05-07T20:32:42.6848824Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.6848833Z 2025-05-07T20:32:42.6849282Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.6849286Z 2025-05-07T20:32:42.6849391Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.6849629Z self=, 2025-05-07T20:32:42.6849752Z T=2048, 2025-05-07T20:32:42.6849830Z D=7168, 2025-05-07T20:32:42.6849919Z scale_ub=1200.0, 2025-05-07T20:32:42.6850007Z contiguous=False, 2025-05-07T20:32:42.6850096Z compiled=False, 2025-05-07T20:32:42.6850176Z ) 2025-05-07T20:32:42.6850406Z self = 2025-05-07T20:32:42.6850633Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:42.6850638Z 2025-05-07T20:32:42.6850720Z @given( 2025-05-07T20:32:42.6850850Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.6850957Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.6851074Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.6851193Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.6851312Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.6851392Z ) 2025-05-07T20:32:42.6851657Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.6851756Z def test_silu_mul_quant( 2025-05-07T20:32:42.6851836Z self, 2025-05-07T20:32:42.6851919Z T: int, 2025-05-07T20:32:42.6851998Z D: int, 2025-05-07T20:32:42.6852101Z scale_ub: Optional[float], 2025-05-07T20:32:42.6852199Z contiguous: bool, 2025-05-07T20:32:42.6852287Z compiled: bool, 2025-05-07T20:32:42.6852365Z ) -> None: 2025-05-07T20:32:42.6852467Z torch.manual_seed(2025) 2025-05-07T20:32:42.6852550Z 2025-05-07T20:32:42.6852726Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.6852808Z 2025-05-07T20:32:42.6852902Z x_sign = torch.sign(x) 2025-05-07T20:32:42.6853029Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.6853124Z x = x_sign * x_clamp 2025-05-07T20:32:42.6853206Z x0 = x[:, :D] 2025-05-07T20:32:42.6853295Z x1 = x[:, D:] 2025-05-07T20:32:42.6853370Z 2025-05-07T20:32:42.6853455Z if contiguous: 2025-05-07T20:32:42.6853550Z x0 = x0.contiguous() 2025-05-07T20:32:42.6853642Z x1 = x1.contiguous() 2025-05-07T20:32:42.6853719Z 2025-05-07T20:32:42.6853818Z if scale_ub is not None: 2025-05-07T20:32:42.6853931Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.6854072Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.6854151Z ) 2025-05-07T20:32:42.6854229Z else: 2025-05-07T20:32:42.6854410Z scale_ub_tensor = None 2025-05-07T20:32:42.6854491Z 2025-05-07T20:32:42.6854623Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.6854711Z op = silu_mul_quant 2025-05-07T20:32:42.6854799Z if compiled: 2025-05-07T20:32:42.6854897Z op = torch.compile(op) 2025-05-07T20:32:42.6855010Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.6855087Z 2025-05-07T20:32:42.6855181Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.6855185Z 2025-05-07T20:32:42.6855289Z moe/activation_test.py:117: 2025-05-07T20:32:42.6855425Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.6855529Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.6855634Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.6856177Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.6856281Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.6856663Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.6856894Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.6857259Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.6857394Z kernel = self.compile( 2025-05-07T20:32:42.6857804Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.6857989Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.6858124Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.6858169Z 2025-05-07T20:32:42.6858385Z self = 2025-05-07T20:32:42.6859240Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.6859791Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3dd409a790>} 2025-05-07T20:32:42.6860606Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.6860802Z context = 2025-05-07T20:32:42.6860809Z 2025-05-07T20:32:42.6860982Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.6861261Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.6861374Z module_map=module_map) 2025-05-07T20:32:42.6861538Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.6861637Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.6861718Z E ^ 2025-05-07T20:32:42.6862097Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.6862105Z 2025-05-07T20:32:42.6862548Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.6862553Z 2025-05-07T20:32:42.6862656Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.6862887Z self=, 2025-05-07T20:32:42.6862978Z T=1, 2025-05-07T20:32:42.6863060Z D=7168, 2025-05-07T20:32:42.6863143Z scale_ub=None, 2025-05-07T20:32:42.6863233Z contiguous=True, 2025-05-07T20:32:42.6863401Z compiled=False, 2025-05-07T20:32:42.6863481Z ) 2025-05-07T20:32:42.6863710Z self = 2025-05-07T20:32:42.6863878Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:42.6863883Z 2025-05-07T20:32:42.6863962Z @given( 2025-05-07T20:32:42.6864088Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.6864193Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.6864316Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.6864434Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.6864550Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.6864630Z ) 2025-05-07T20:32:42.6864888Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.6864988Z def test_silu_mul_quant( 2025-05-07T20:32:42.6865074Z self, 2025-05-07T20:32:42.6865155Z T: int, 2025-05-07T20:32:42.6865244Z D: int, 2025-05-07T20:32:42.6865350Z scale_ub: Optional[float], 2025-05-07T20:32:42.6865441Z contiguous: bool, 2025-05-07T20:32:42.6865529Z compiled: bool, 2025-05-07T20:32:42.6865614Z ) -> None: 2025-05-07T20:32:42.6865711Z torch.manual_seed(2025) 2025-05-07T20:32:42.6865789Z 2025-05-07T20:32:42.6865967Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.6866082Z 2025-05-07T20:32:42.6866177Z x_sign = torch.sign(x) 2025-05-07T20:32:42.6866299Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.6866393Z x = x_sign * x_clamp 2025-05-07T20:32:42.6866477Z x0 = x[:, :D] 2025-05-07T20:32:42.6866559Z x1 = x[:, D:] 2025-05-07T20:32:42.6866694Z 2025-05-07T20:32:42.6866796Z if contiguous: 2025-05-07T20:32:42.6866904Z x0 = x0.contiguous() 2025-05-07T20:32:42.6866995Z x1 = x1.contiguous() 2025-05-07T20:32:42.6867082Z 2025-05-07T20:32:42.6867176Z if scale_ub is not None: 2025-05-07T20:32:42.6867288Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.6867425Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.6867504Z ) 2025-05-07T20:32:42.6867587Z else: 2025-05-07T20:32:42.6867682Z scale_ub_tensor = None 2025-05-07T20:32:42.6867761Z 2025-05-07T20:32:42.6867898Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.6867993Z op = silu_mul_quant 2025-05-07T20:32:42.6868079Z if compiled: 2025-05-07T20:32:42.6868185Z op = torch.compile(op) 2025-05-07T20:32:42.6868292Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.6868370Z 2025-05-07T20:32:42.6868465Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.6868470Z 2025-05-07T20:32:42.6868568Z moe/activation_test.py:117: 2025-05-07T20:32:42.6868711Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.6868814Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.6868916Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.6869465Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.6869566Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.6870071Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.6870315Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.6870682Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.6870783Z kernel = self.compile( 2025-05-07T20:32:42.6871196Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.6871461Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.6871598Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.6871602Z 2025-05-07T20:32:42.6871815Z self = 2025-05-07T20:32:42.6872664Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.6873215Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3dd40d40d0>} 2025-05-07T20:32:42.6874030Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.6874233Z context = 2025-05-07T20:32:42.6874237Z 2025-05-07T20:32:42.6874406Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.6874685Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.6874795Z module_map=module_map) 2025-05-07T20:32:42.6875000Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.6875104Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.6875184Z E ^ 2025-05-07T20:32:42.6875569Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.6875617Z 2025-05-07T20:32:42.6876065Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.6876070Z 2025-05-07T20:32:42.6876175Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.6876411Z self=, 2025-05-07T20:32:42.6876493Z T=16384, 2025-05-07T20:32:42.6876573Z D=7168, 2025-05-07T20:32:42.6876663Z scale_ub=1200.0, 2025-05-07T20:32:42.6876750Z contiguous=False, 2025-05-07T20:32:42.6876837Z compiled=True, 2025-05-07T20:32:42.6876916Z ) 2025-05-07T20:32:42.6877146Z self = 2025-05-07T20:32:42.6877337Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:42.6877341Z 2025-05-07T20:32:42.6877423Z @given( 2025-05-07T20:32:42.6877544Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.6877649Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.6877771Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.6877889Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.6878012Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.6878089Z ) 2025-05-07T20:32:42.6878353Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.6878449Z def test_silu_mul_quant( 2025-05-07T20:32:42.6878530Z self, 2025-05-07T20:32:42.6878614Z T: int, 2025-05-07T20:32:42.6878695Z D: int, 2025-05-07T20:32:42.6878797Z scale_ub: Optional[float], 2025-05-07T20:32:42.6878894Z contiguous: bool, 2025-05-07T20:32:42.6878981Z compiled: bool, 2025-05-07T20:32:42.6879062Z ) -> None: 2025-05-07T20:32:42.6879160Z torch.manual_seed(2025) 2025-05-07T20:32:42.6879238Z 2025-05-07T20:32:42.6879415Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.6879497Z 2025-05-07T20:32:42.6879592Z x_sign = torch.sign(x) 2025-05-07T20:32:42.6879721Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.6879923Z x = x_sign * x_clamp 2025-05-07T20:32:42.6880006Z x0 = x[:, :D] 2025-05-07T20:32:42.6880085Z x1 = x[:, D:] 2025-05-07T20:32:42.6880157Z 2025-05-07T20:32:42.6880243Z if contiguous: 2025-05-07T20:32:42.6880340Z x0 = x0.contiguous() 2025-05-07T20:32:42.6880431Z x1 = x1.contiguous() 2025-05-07T20:32:42.6880508Z 2025-05-07T20:32:42.6880608Z if scale_ub is not None: 2025-05-07T20:32:42.6880723Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.6880864Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.6880948Z ) 2025-05-07T20:32:42.6881028Z else: 2025-05-07T20:32:42.6881123Z scale_ub_tensor = None 2025-05-07T20:32:42.6881202Z 2025-05-07T20:32:42.6881336Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.6881431Z op = silu_mul_quant 2025-05-07T20:32:42.6881521Z if compiled: 2025-05-07T20:32:42.6881628Z op = torch.compile(op) 2025-05-07T20:32:42.6881741Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.6881814Z 2025-05-07T20:32:42.6881908Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.6881913Z 2025-05-07T20:32:42.6882016Z moe/activation_test.py:117: 2025-05-07T20:32:42.6882149Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.6882252Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.6882403Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.6882988Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:42.6883132Z return fn(*args, **kwargs) 2025-05-07T20:32:42.6883688Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.6883884Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.6884274Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.6884508Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.6884874Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.6884974Z kernel = self.compile( 2025-05-07T20:32:42.6885384Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.6885572Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.6885705Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.6885710Z 2025-05-07T20:32:42.6885924Z self = 2025-05-07T20:32:42.6886779Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.6887332Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3dd40d4d30>} 2025-05-07T20:32:42.6888149Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.6888349Z context = 2025-05-07T20:32:42.6888353Z 2025-05-07T20:32:42.6888526Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.6888804Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.6888911Z module_map=module_map) 2025-05-07T20:32:42.6889195Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.6889299Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.6889384Z E ^ 2025-05-07T20:32:42.6889773Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.6889778Z 2025-05-07T20:32:42.6890229Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.6890237Z 2025-05-07T20:32:42.6890344Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.6894456Z self=, 2025-05-07T20:32:42.6894556Z T=1, 2025-05-07T20:32:42.6894634Z D=7168, 2025-05-07T20:32:42.6894716Z scale_ub=None, 2025-05-07T20:32:42.6894813Z contiguous=False, 2025-05-07T20:32:42.6894897Z compiled=False, 2025-05-07T20:32:42.6894970Z ) 2025-05-07T20:32:42.6895212Z self = 2025-05-07T20:32:42.6895390Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:42.6895395Z 2025-05-07T20:32:42.6895472Z @given( 2025-05-07T20:32:42.6895596Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.6895698Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.6895815Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.6896028Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.6896140Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.6896219Z ) 2025-05-07T20:32:42.6896485Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.6896587Z def test_silu_mul_quant( 2025-05-07T20:32:42.6896708Z self, 2025-05-07T20:32:42.6896786Z T: int, 2025-05-07T20:32:42.6896867Z D: int, 2025-05-07T20:32:42.6896970Z scale_ub: Optional[float], 2025-05-07T20:32:42.6897066Z contiguous: bool, 2025-05-07T20:32:42.6897155Z compiled: bool, 2025-05-07T20:32:42.6897237Z ) -> None: 2025-05-07T20:32:42.6897336Z torch.manual_seed(2025) 2025-05-07T20:32:42.6897418Z 2025-05-07T20:32:42.6897593Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.6897667Z 2025-05-07T20:32:42.6897764Z x_sign = torch.sign(x) 2025-05-07T20:32:42.6897892Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.6897986Z x = x_sign * x_clamp 2025-05-07T20:32:42.6898073Z x0 = x[:, :D] 2025-05-07T20:32:42.6898150Z x1 = x[:, D:] 2025-05-07T20:32:42.6898226Z 2025-05-07T20:32:42.6898309Z if contiguous: 2025-05-07T20:32:42.6898402Z x0 = x0.contiguous() 2025-05-07T20:32:42.6898500Z x1 = x1.contiguous() 2025-05-07T20:32:42.6898570Z 2025-05-07T20:32:42.6898657Z if scale_ub is not None: 2025-05-07T20:32:42.6898767Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.6898906Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.6898982Z ) 2025-05-07T20:32:42.6899063Z else: 2025-05-07T20:32:42.6899156Z scale_ub_tensor = None 2025-05-07T20:32:42.6899229Z 2025-05-07T20:32:42.6899363Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.6899458Z op = silu_mul_quant 2025-05-07T20:32:42.6899551Z if compiled: 2025-05-07T20:32:42.6899654Z op = torch.compile(op) 2025-05-07T20:32:42.6899761Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.6899841Z 2025-05-07T20:32:42.6899938Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.6899943Z 2025-05-07T20:32:42.6900041Z moe/activation_test.py:117: 2025-05-07T20:32:42.6900181Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.6900290Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.6900475Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.6901031Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.6901137Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.6901526Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.6901762Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.6902128Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.6902227Z kernel = self.compile( 2025-05-07T20:32:42.6902642Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.6902824Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.6902965Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.6902970Z 2025-05-07T20:32:42.6903181Z self = 2025-05-07T20:32:42.6904031Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.6904624Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3dd4151700>} 2025-05-07T20:32:42.6905445Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.6905682Z context = 2025-05-07T20:32:42.6905687Z 2025-05-07T20:32:42.6905864Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.6906153Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.6906263Z module_map=module_map) 2025-05-07T20:32:42.6906436Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.6906561Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.6906655Z E ^ 2025-05-07T20:32:42.6907064Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.6907069Z 2025-05-07T20:32:42.6907520Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.6907529Z 2025-05-07T20:32:42.6907630Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.6907866Z self=, 2025-05-07T20:32:42.6907952Z T=2048, 2025-05-07T20:32:42.6908033Z D=7168, 2025-05-07T20:32:42.6908118Z scale_ub=None, 2025-05-07T20:32:42.6908207Z contiguous=False, 2025-05-07T20:32:42.6908295Z compiled=True, 2025-05-07T20:32:42.6908373Z ) 2025-05-07T20:32:42.6908598Z self = 2025-05-07T20:32:42.6908782Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:42.6908789Z 2025-05-07T20:32:42.6908868Z @given( 2025-05-07T20:32:42.6908990Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.6909098Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.6909216Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.6909344Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.6909465Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.6909542Z ) 2025-05-07T20:32:42.6909969Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.6910069Z def test_silu_mul_quant( 2025-05-07T20:32:42.6910150Z self, 2025-05-07T20:32:42.6910234Z T: int, 2025-05-07T20:32:42.6910312Z D: int, 2025-05-07T20:32:42.6910413Z scale_ub: Optional[float], 2025-05-07T20:32:42.6910507Z contiguous: bool, 2025-05-07T20:32:42.6910594Z compiled: bool, 2025-05-07T20:32:42.6910679Z ) -> None: 2025-05-07T20:32:42.6910777Z torch.manual_seed(2025) 2025-05-07T20:32:42.6910852Z 2025-05-07T20:32:42.6911036Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.6911115Z 2025-05-07T20:32:42.6911210Z x_sign = torch.sign(x) 2025-05-07T20:32:42.6911342Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.6911436Z x = x_sign * x_clamp 2025-05-07T20:32:42.6911519Z x0 = x[:, :D] 2025-05-07T20:32:42.6911606Z x1 = x[:, D:] 2025-05-07T20:32:42.6911679Z 2025-05-07T20:32:42.6911768Z if contiguous: 2025-05-07T20:32:42.6911866Z x0 = x0.contiguous() 2025-05-07T20:32:42.6911958Z x1 = x1.contiguous() 2025-05-07T20:32:42.6912036Z 2025-05-07T20:32:42.6912131Z if scale_ub is not None: 2025-05-07T20:32:42.6912241Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.6912379Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.6912501Z ) 2025-05-07T20:32:42.6912575Z else: 2025-05-07T20:32:42.6912670Z scale_ub_tensor = None 2025-05-07T20:32:42.6912743Z 2025-05-07T20:32:42.6912872Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.6912967Z op = silu_mul_quant 2025-05-07T20:32:42.6913050Z if compiled: 2025-05-07T20:32:42.6913218Z op = torch.compile(op) 2025-05-07T20:32:42.6913326Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.6913396Z 2025-05-07T20:32:42.6913490Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.6913494Z 2025-05-07T20:32:42.6913593Z moe/activation_test.py:117: 2025-05-07T20:32:42.6913725Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.6913831Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.6913928Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.6914320Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:42.6914417Z return fn(*args, **kwargs) 2025-05-07T20:32:42.6914953Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.6915052Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.6915445Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.6915685Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.6916052Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.6916148Z kernel = self.compile( 2025-05-07T20:32:42.6916561Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.6916744Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.6916882Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.6916886Z 2025-05-07T20:32:42.6917109Z self = 2025-05-07T20:32:42.6917957Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.6918589Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3dd41de3a0>} 2025-05-07T20:32:42.6919414Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.6919613Z context = 2025-05-07T20:32:42.6919620Z 2025-05-07T20:32:42.6919794Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.6920072Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.6920182Z module_map=module_map) 2025-05-07T20:32:42.6920349Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.6920446Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.6920526Z E ^ 2025-05-07T20:32:42.6920919Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.6920924Z 2025-05-07T20:32:42.6921373Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.6921377Z 2025-05-07T20:32:42.6921487Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.6921757Z self=, 2025-05-07T20:32:42.6921839Z T=4096, 2025-05-07T20:32:42.6921918Z D=7168, 2025-05-07T20:32:42.6922004Z scale_ub=None, 2025-05-07T20:32:42.6922095Z contiguous=False, 2025-05-07T20:32:42.6922182Z compiled=True, 2025-05-07T20:32:42.6922258Z ) 2025-05-07T20:32:42.6922527Z self = 2025-05-07T20:32:42.6922706Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:42.6922710Z 2025-05-07T20:32:42.6922789Z @given( 2025-05-07T20:32:42.6922911Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.6923008Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.6923122Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.6923240Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.6923355Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.6923434Z ) 2025-05-07T20:32:42.6923693Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.6923786Z def test_silu_mul_quant( 2025-05-07T20:32:42.6923864Z self, 2025-05-07T20:32:42.6923944Z T: int, 2025-05-07T20:32:42.6924024Z D: int, 2025-05-07T20:32:42.6924131Z scale_ub: Optional[float], 2025-05-07T20:32:42.6924220Z contiguous: bool, 2025-05-07T20:32:42.6924306Z compiled: bool, 2025-05-07T20:32:42.6924389Z ) -> None: 2025-05-07T20:32:42.6924490Z torch.manual_seed(2025) 2025-05-07T20:32:42.6924564Z 2025-05-07T20:32:42.6924741Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.6924819Z 2025-05-07T20:32:42.6924913Z x_sign = torch.sign(x) 2025-05-07T20:32:42.6925042Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.6925132Z x = x_sign * x_clamp 2025-05-07T20:32:42.6925219Z x0 = x[:, :D] 2025-05-07T20:32:42.6925304Z x1 = x[:, D:] 2025-05-07T20:32:42.6925376Z 2025-05-07T20:32:42.6925468Z if contiguous: 2025-05-07T20:32:42.6925562Z x0 = x0.contiguous() 2025-05-07T20:32:42.6925653Z x1 = x1.contiguous() 2025-05-07T20:32:42.6925735Z 2025-05-07T20:32:42.6925831Z if scale_ub is not None: 2025-05-07T20:32:42.6925942Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.6926087Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.6926163Z ) 2025-05-07T20:32:42.6926323Z else: 2025-05-07T20:32:42.6926426Z scale_ub_tensor = None 2025-05-07T20:32:42.6926499Z 2025-05-07T20:32:42.6926632Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.6926722Z op = silu_mul_quant 2025-05-07T20:32:42.6926806Z if compiled: 2025-05-07T20:32:42.6926912Z op = torch.compile(op) 2025-05-07T20:32:42.6927020Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.6927096Z 2025-05-07T20:32:42.6927189Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.6927193Z 2025-05-07T20:32:42.6927292Z moe/activation_test.py:117: 2025-05-07T20:32:42.6927425Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.6927530Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.6927633Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.6928034Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:42.6928129Z return fn(*args, **kwargs) 2025-05-07T20:32:42.6928667Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.6928765Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.6929150Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.6929423Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.6929788Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.6929880Z kernel = self.compile( 2025-05-07T20:32:42.6930290Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.6930509Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.6930643Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.6930647Z 2025-05-07T20:32:42.6930863Z self = 2025-05-07T20:32:42.6931712Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.6932269Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3dd41de700>} 2025-05-07T20:32:42.6933081Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.6933285Z context = 2025-05-07T20:32:42.6933293Z 2025-05-07T20:32:42.6933462Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.6933740Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.6933853Z module_map=module_map) 2025-05-07T20:32:42.6934019Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.6934122Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.6934204Z E ^ 2025-05-07T20:32:42.6934588Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.6934592Z 2025-05-07T20:32:42.6935045Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.6935053Z 2025-05-07T20:32:42.6935158Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.6935472Z self=, 2025-05-07T20:32:42.6935552Z T=16384, 2025-05-07T20:32:42.6935628Z D=5120, 2025-05-07T20:32:42.6935715Z scale_ub=1200.0, 2025-05-07T20:32:42.6935806Z contiguous=False, 2025-05-07T20:32:42.6935895Z compiled=False, 2025-05-07T20:32:42.6935970Z ) 2025-05-07T20:32:42.6936202Z self = 2025-05-07T20:32:42.6936393Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:42.6936398Z 2025-05-07T20:32:42.6936479Z @given( 2025-05-07T20:32:42.6936603Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.6936701Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.6936822Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.6936944Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.6937059Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.6937141Z ) 2025-05-07T20:32:42.6937408Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.6937507Z def test_silu_mul_quant( 2025-05-07T20:32:42.6937583Z self, 2025-05-07T20:32:42.6937662Z T: int, 2025-05-07T20:32:42.6937744Z D: int, 2025-05-07T20:32:42.6937845Z scale_ub: Optional[float], 2025-05-07T20:32:42.6937936Z contiguous: bool, 2025-05-07T20:32:42.6938065Z compiled: bool, 2025-05-07T20:32:42.6938145Z ) -> None: 2025-05-07T20:32:42.6938242Z torch.manual_seed(2025) 2025-05-07T20:32:42.6938323Z 2025-05-07T20:32:42.6938499Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.6938578Z 2025-05-07T20:32:42.6938676Z x_sign = torch.sign(x) 2025-05-07T20:32:42.6938840Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.6938934Z x = x_sign * x_clamp 2025-05-07T20:32:42.6939012Z x0 = x[:, :D] 2025-05-07T20:32:42.6939097Z x1 = x[:, D:] 2025-05-07T20:32:42.6939174Z 2025-05-07T20:32:42.6939258Z if contiguous: 2025-05-07T20:32:42.6939352Z x0 = x0.contiguous() 2025-05-07T20:32:42.6939444Z x1 = x1.contiguous() 2025-05-07T20:32:42.6939519Z 2025-05-07T20:32:42.6939609Z if scale_ub is not None: 2025-05-07T20:32:42.6939715Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.6939851Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.6939927Z ) 2025-05-07T20:32:42.6940006Z else: 2025-05-07T20:32:42.6940102Z scale_ub_tensor = None 2025-05-07T20:32:42.6940179Z 2025-05-07T20:32:42.6940315Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.6940408Z op = silu_mul_quant 2025-05-07T20:32:42.6940499Z if compiled: 2025-05-07T20:32:42.6940602Z op = torch.compile(op) 2025-05-07T20:32:42.6940710Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.6940792Z 2025-05-07T20:32:42.6940885Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.6940889Z 2025-05-07T20:32:42.6940988Z moe/activation_test.py:117: 2025-05-07T20:32:42.6941131Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.6941235Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.6941336Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.6941883Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.6941985Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.6942377Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.6942617Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.6943063Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.6943161Z kernel = self.compile( 2025-05-07T20:32:42.6943573Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.6943755Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.6943885Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.6943892Z 2025-05-07T20:32:42.6944103Z self = 2025-05-07T20:32:42.6944955Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.6945506Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3dd3e83790>} 2025-05-07T20:32:42.6946325Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.6946521Z context = 2025-05-07T20:32:42.6946526Z 2025-05-07T20:32:42.6946694Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.6947038Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.6947147Z module_map=module_map) 2025-05-07T20:32:42.6947311Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.6947447Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.6947527Z E ^ 2025-05-07T20:32:42.6947917Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.6947926Z 2025-05-07T20:32:42.6948374Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.6948379Z 2025-05-07T20:32:42.6948485Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.6948719Z self=, 2025-05-07T20:32:42.6948801Z T=16384, 2025-05-07T20:32:42.6948888Z D=5120, 2025-05-07T20:32:42.6948973Z scale_ub=1200.0, 2025-05-07T20:32:42.6949059Z contiguous=True, 2025-05-07T20:32:42.6949146Z compiled=True, 2025-05-07T20:32:42.6949219Z ) 2025-05-07T20:32:42.6949449Z self = 2025-05-07T20:32:42.6949636Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:42.6949644Z 2025-05-07T20:32:42.6949770Z @given( 2025-05-07T20:32:42.6949895Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.6950003Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.6950121Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.6950242Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.6950356Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.6950435Z ) 2025-05-07T20:32:42.6950701Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.6950798Z def test_silu_mul_quant( 2025-05-07T20:32:42.6950874Z self, 2025-05-07T20:32:42.6950957Z T: int, 2025-05-07T20:32:42.6951034Z D: int, 2025-05-07T20:32:42.6951134Z scale_ub: Optional[float], 2025-05-07T20:32:42.6951228Z contiguous: bool, 2025-05-07T20:32:42.6951315Z compiled: bool, 2025-05-07T20:32:42.6951399Z ) -> None: 2025-05-07T20:32:42.6951495Z torch.manual_seed(2025) 2025-05-07T20:32:42.6951573Z 2025-05-07T20:32:42.6951833Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.6951911Z 2025-05-07T20:32:42.6952000Z x_sign = torch.sign(x) 2025-05-07T20:32:42.6952130Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.6952221Z x = x_sign * x_clamp 2025-05-07T20:32:42.6952303Z x0 = x[:, :D] 2025-05-07T20:32:42.6952391Z x1 = x[:, D:] 2025-05-07T20:32:42.6952467Z 2025-05-07T20:32:42.6952554Z if contiguous: 2025-05-07T20:32:42.6952655Z x0 = x0.contiguous() 2025-05-07T20:32:42.6952749Z x1 = x1.contiguous() 2025-05-07T20:32:42.6952828Z 2025-05-07T20:32:42.6952924Z if scale_ub is not None: 2025-05-07T20:32:42.6953032Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.6953175Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.6953256Z ) 2025-05-07T20:32:42.6953333Z else: 2025-05-07T20:32:42.6953430Z scale_ub_tensor = None 2025-05-07T20:32:42.6953517Z 2025-05-07T20:32:42.6953657Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.6953747Z op = silu_mul_quant 2025-05-07T20:32:42.6953839Z if compiled: 2025-05-07T20:32:42.6953945Z op = torch.compile(op) 2025-05-07T20:32:42.6954056Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.6954129Z 2025-05-07T20:32:42.6954222Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.6954268Z 2025-05-07T20:32:42.6954369Z moe/activation_test.py:117: 2025-05-07T20:32:42.6954501Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.6954604Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.6954708Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.6955100Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:42.6955232Z return fn(*args, **kwargs) 2025-05-07T20:32:42.6955773Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.6955871Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.6956254Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.6956487Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.6956854Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.6956951Z kernel = self.compile( 2025-05-07T20:32:42.6957363Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.6957547Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.6957685Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.6957689Z 2025-05-07T20:32:42.6957907Z self = 2025-05-07T20:32:42.6958757Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.6959311Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3dd3d91550>} 2025-05-07T20:32:42.6960133Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.6960334Z context = 2025-05-07T20:32:42.6960338Z 2025-05-07T20:32:42.6960510Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.6960874Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.6960981Z module_map=module_map) 2025-05-07T20:32:42.6961145Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.6961243Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.6961321Z E ^ 2025-05-07T20:32:42.6961710Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.6961717Z 2025-05-07T20:32:42.6962166Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.6962170Z 2025-05-07T20:32:42.6962278Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.6962515Z self=, 2025-05-07T20:32:42.6962597Z T=16384, 2025-05-07T20:32:42.6962680Z D=5120, 2025-05-07T20:32:42.6962770Z scale_ub=None, 2025-05-07T20:32:42.6962858Z contiguous=False, 2025-05-07T20:32:42.6962945Z compiled=True, 2025-05-07T20:32:42.6963021Z ) 2025-05-07T20:32:42.6963250Z self = 2025-05-07T20:32:42.6963439Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:42.6963443Z 2025-05-07T20:32:42.6963561Z @given( 2025-05-07T20:32:42.6963681Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.6963779Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.6963894Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.6964013Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.6964125Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.6964242Z ) 2025-05-07T20:32:42.6964507Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.6964605Z def test_silu_mul_quant( 2025-05-07T20:32:42.6964684Z self, 2025-05-07T20:32:42.6964765Z T: int, 2025-05-07T20:32:42.6964844Z D: int, 2025-05-07T20:32:42.6964945Z scale_ub: Optional[float], 2025-05-07T20:32:42.6965042Z contiguous: bool, 2025-05-07T20:32:42.6965130Z compiled: bool, 2025-05-07T20:32:42.6965212Z ) -> None: 2025-05-07T20:32:42.6965309Z torch.manual_seed(2025) 2025-05-07T20:32:42.6965389Z 2025-05-07T20:32:42.6965568Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.6965647Z 2025-05-07T20:32:42.6965742Z x_sign = torch.sign(x) 2025-05-07T20:32:42.6965871Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.6965962Z x = x_sign * x_clamp 2025-05-07T20:32:42.6966047Z x0 = x[:, :D] 2025-05-07T20:32:42.6966132Z x1 = x[:, D:] 2025-05-07T20:32:42.6966203Z 2025-05-07T20:32:42.6966288Z if contiguous: 2025-05-07T20:32:42.6966388Z x0 = x0.contiguous() 2025-05-07T20:32:42.6966480Z x1 = x1.contiguous() 2025-05-07T20:32:42.6966560Z 2025-05-07T20:32:42.6966657Z if scale_ub is not None: 2025-05-07T20:32:42.6966764Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.6966904Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.6966982Z ) 2025-05-07T20:32:42.6967058Z else: 2025-05-07T20:32:42.6967157Z scale_ub_tensor = None 2025-05-07T20:32:42.6967233Z 2025-05-07T20:32:42.6967364Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.6967459Z op = silu_mul_quant 2025-05-07T20:32:42.6967545Z if compiled: 2025-05-07T20:32:42.6967647Z op = torch.compile(op) 2025-05-07T20:32:42.6967762Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.6967838Z 2025-05-07T20:32:42.6967931Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.6967938Z 2025-05-07T20:32:42.6968120Z moe/activation_test.py:117: 2025-05-07T20:32:42.6968256Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.6968357Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.6968456Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.6968849Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:42.6968950Z return fn(*args, **kwargs) 2025-05-07T20:32:42.6969485Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.6969590Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.6969975Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.6970215Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.6970588Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.6970683Z kernel = self.compile( 2025-05-07T20:32:42.6971095Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.6971282Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.6971415Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.6971460Z 2025-05-07T20:32:42.6971679Z self = 2025-05-07T20:32:42.6972525Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.6973119Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3dd3d910d0>} 2025-05-07T20:32:42.6973937Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.6974135Z context = 2025-05-07T20:32:42.6974143Z 2025-05-07T20:32:42.6974314Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.6974592Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.6974699Z module_map=module_map) 2025-05-07T20:32:42.6974863Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.6974965Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.6975046Z E ^ 2025-05-07T20:32:42.6975432Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.6975436Z 2025-05-07T20:32:42.6975881Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.6975886Z 2025-05-07T20:32:42.6975989Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.6976220Z self=, 2025-05-07T20:32:42.6976304Z T=2048, 2025-05-07T20:32:42.6976385Z D=5120, 2025-05-07T20:32:42.6976470Z scale_ub=None, 2025-05-07T20:32:42.6976560Z contiguous=False, 2025-05-07T20:32:42.6976646Z compiled=True, 2025-05-07T20:32:42.6976723Z ) 2025-05-07T20:32:42.6976956Z self = 2025-05-07T20:32:42.6977140Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:42.6977144Z 2025-05-07T20:32:42.6977225Z @given( 2025-05-07T20:32:42.6977460Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.6977561Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.6977680Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.6977794Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.6977906Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.6977985Z ) 2025-05-07T20:32:42.6978242Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.6978335Z def test_silu_mul_quant( 2025-05-07T20:32:42.6978413Z self, 2025-05-07T20:32:42.6978492Z T: int, 2025-05-07T20:32:42.6978572Z D: int, 2025-05-07T20:32:42.6978676Z scale_ub: Optional[float], 2025-05-07T20:32:42.6978767Z contiguous: bool, 2025-05-07T20:32:42.6978858Z compiled: bool, 2025-05-07T20:32:42.6978943Z ) -> None: 2025-05-07T20:32:42.6979039Z torch.manual_seed(2025) 2025-05-07T20:32:42.6979119Z 2025-05-07T20:32:42.6979299Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.6979379Z 2025-05-07T20:32:42.6979476Z x_sign = torch.sign(x) 2025-05-07T20:32:42.6979603Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.6979696Z x = x_sign * x_clamp 2025-05-07T20:32:42.6979781Z x0 = x[:, :D] 2025-05-07T20:32:42.6979866Z x1 = x[:, D:] 2025-05-07T20:32:42.6979982Z 2025-05-07T20:32:42.6980068Z if contiguous: 2025-05-07T20:32:42.6980161Z x0 = x0.contiguous() 2025-05-07T20:32:42.6980249Z x1 = x1.contiguous() 2025-05-07T20:32:42.6980326Z 2025-05-07T20:32:42.6980420Z if scale_ub is not None: 2025-05-07T20:32:42.6980526Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.6980706Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.6980779Z ) 2025-05-07T20:32:42.6980858Z else: 2025-05-07T20:32:42.6980961Z scale_ub_tensor = None 2025-05-07T20:32:42.6981033Z 2025-05-07T20:32:42.6981167Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.6981258Z op = silu_mul_quant 2025-05-07T20:32:42.6981345Z if compiled: 2025-05-07T20:32:42.6981451Z op = torch.compile(op) 2025-05-07T20:32:42.6981559Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.6981631Z 2025-05-07T20:32:42.6981729Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.6981734Z 2025-05-07T20:32:42.6981834Z moe/activation_test.py:117: 2025-05-07T20:32:42.6981972Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.6982077Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.6982179Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.6982576Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:42.6982667Z return fn(*args, **kwargs) 2025-05-07T20:32:42.6983485Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.6983595Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.6983983Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.6984220Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.6984587Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.6984679Z kernel = self.compile( 2025-05-07T20:32:42.6985093Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.6985276Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.6985408Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.6985558Z 2025-05-07T20:32:42.6985794Z self = 2025-05-07T20:32:42.6986784Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.6987424Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3dd3d14af0>} 2025-05-07T20:32:42.6988366Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.6988588Z context = 2025-05-07T20:32:42.6988593Z 2025-05-07T20:32:42.6988783Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.6989095Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.6989210Z module_map=module_map) 2025-05-07T20:32:42.6989389Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.6989495Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.6989633Z E ^ 2025-05-07T20:32:42.6990081Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.6990086Z 2025-05-07T20:32:42.6990536Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.6990602Z 2025-05-07T20:32:42.6990706Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.6990939Z self=, 2025-05-07T20:32:42.6991024Z T=2048, 2025-05-07T20:32:42.6991109Z D=5120, 2025-05-07T20:32:42.6991198Z scale_ub=1200.0, 2025-05-07T20:32:42.6991291Z contiguous=False, 2025-05-07T20:32:42.6991380Z compiled=True, 2025-05-07T20:32:42.6991468Z ) 2025-05-07T20:32:42.6991697Z self = 2025-05-07T20:32:42.6991883Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:42.6991890Z 2025-05-07T20:32:42.6991970Z @given( 2025-05-07T20:32:42.6992092Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.6992193Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.6992315Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.6992436Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.6992559Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.6992639Z ) 2025-05-07T20:32:42.6992903Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.6993007Z def test_silu_mul_quant( 2025-05-07T20:32:42.6993088Z self, 2025-05-07T20:32:42.6993169Z T: int, 2025-05-07T20:32:42.6993251Z D: int, 2025-05-07T20:32:42.6993354Z scale_ub: Optional[float], 2025-05-07T20:32:42.6993446Z contiguous: bool, 2025-05-07T20:32:42.6993538Z compiled: bool, 2025-05-07T20:32:42.6993619Z ) -> None: 2025-05-07T20:32:42.6993720Z torch.manual_seed(2025) 2025-05-07T20:32:42.6993800Z 2025-05-07T20:32:42.6993975Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.6994056Z 2025-05-07T20:32:42.6994151Z x_sign = torch.sign(x) 2025-05-07T20:32:42.6994278Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.6994377Z x = x_sign * x_clamp 2025-05-07T20:32:42.6994460Z x0 = x[:, :D] 2025-05-07T20:32:42.6994544Z x1 = x[:, D:] 2025-05-07T20:32:42.6994622Z 2025-05-07T20:32:42.6994793Z if contiguous: 2025-05-07T20:32:42.6994885Z x0 = x0.contiguous() 2025-05-07T20:32:42.6994977Z x1 = x1.contiguous() 2025-05-07T20:32:42.6995048Z 2025-05-07T20:32:42.6995135Z if scale_ub is not None: 2025-05-07T20:32:42.6995247Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.6995385Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.6995466Z ) 2025-05-07T20:32:42.6995550Z else: 2025-05-07T20:32:42.6995648Z scale_ub_tensor = None 2025-05-07T20:32:42.6995727Z 2025-05-07T20:32:42.6995859Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.6995952Z op = silu_mul_quant 2025-05-07T20:32:42.6996040Z if compiled: 2025-05-07T20:32:42.6996146Z op = torch.compile(op) 2025-05-07T20:32:42.6996254Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.6996341Z 2025-05-07T20:32:42.6996436Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.6996445Z 2025-05-07T20:32:42.6996543Z moe/activation_test.py:117: 2025-05-07T20:32:42.6996682Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.6996786Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.6996890Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.6997283Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:42.6997419Z return fn(*args, **kwargs) 2025-05-07T20:32:42.6997961Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.6998062Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.6998483Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.6998720Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.6999091Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.6999190Z kernel = self.compile( 2025-05-07T20:32:42.6999603Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.6999785Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.6999925Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.6999929Z 2025-05-07T20:32:42.7000143Z self = 2025-05-07T20:32:42.7000998Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.7001563Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3dd3adf820>} 2025-05-07T20:32:42.7002382Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.7002585Z context = 2025-05-07T20:32:42.7002592Z 2025-05-07T20:32:42.7002761Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.7003041Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.7003148Z module_map=module_map) 2025-05-07T20:32:42.7003315Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.7003416Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.7003495Z E ^ 2025-05-07T20:32:42.7003962Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.7003968Z 2025-05-07T20:32:42.7004416Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.7004420Z 2025-05-07T20:32:42.7004522Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.7004757Z self=, 2025-05-07T20:32:42.7004839Z T=4096, 2025-05-07T20:32:42.7004923Z D=5120, 2025-05-07T20:32:42.7005012Z scale_ub=1200.0, 2025-05-07T20:32:42.7005101Z contiguous=True, 2025-05-07T20:32:42.7005190Z compiled=True, 2025-05-07T20:32:42.7005265Z ) 2025-05-07T20:32:42.7005495Z self = 2025-05-07T20:32:42.7005682Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:42.7005686Z 2025-05-07T20:32:42.7005775Z @given( 2025-05-07T20:32:42.7005895Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.7006002Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.7006119Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.7006240Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.7006360Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.7006484Z ) 2025-05-07T20:32:42.7006748Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.7006846Z def test_silu_mul_quant( 2025-05-07T20:32:42.7006925Z self, 2025-05-07T20:32:42.7007009Z T: int, 2025-05-07T20:32:42.7007090Z D: int, 2025-05-07T20:32:42.7007190Z scale_ub: Optional[float], 2025-05-07T20:32:42.7007321Z contiguous: bool, 2025-05-07T20:32:42.7007407Z compiled: bool, 2025-05-07T20:32:42.7007485Z ) -> None: 2025-05-07T20:32:42.7007583Z torch.manual_seed(2025) 2025-05-07T20:32:42.7007663Z 2025-05-07T20:32:42.7007839Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.7007923Z 2025-05-07T20:32:42.7008020Z x_sign = torch.sign(x) 2025-05-07T20:32:42.7008154Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.7008247Z x = x_sign * x_clamp 2025-05-07T20:32:42.7008332Z x0 = x[:, :D] 2025-05-07T20:32:42.7008422Z x1 = x[:, D:] 2025-05-07T20:32:42.7008499Z 2025-05-07T20:32:42.7008587Z if contiguous: 2025-05-07T20:32:42.7008684Z x0 = x0.contiguous() 2025-05-07T20:32:42.7008776Z x1 = x1.contiguous() 2025-05-07T20:32:42.7008855Z 2025-05-07T20:32:42.7008951Z if scale_ub is not None: 2025-05-07T20:32:42.7009063Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.7009203Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.7009288Z ) 2025-05-07T20:32:42.7009368Z else: 2025-05-07T20:32:42.7009468Z scale_ub_tensor = None 2025-05-07T20:32:42.7009549Z 2025-05-07T20:32:42.7009682Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.7009779Z op = silu_mul_quant 2025-05-07T20:32:42.7009867Z if compiled: 2025-05-07T20:32:42.7009971Z op = torch.compile(op) 2025-05-07T20:32:42.7010083Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.7010166Z 2025-05-07T20:32:42.7010262Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.7010266Z 2025-05-07T20:32:42.7010370Z moe/activation_test.py:117: 2025-05-07T20:32:42.7010505Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.7010607Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.7010714Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.7011108Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:42.7011311Z return fn(*args, **kwargs) 2025-05-07T20:32:42.7011854Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.7011953Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.7012347Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.7012588Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.7012957Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.7013053Z kernel = self.compile( 2025-05-07T20:32:42.7013466Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.7013655Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.7013794Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.7013799Z 2025-05-07T20:32:42.7014016Z self = 2025-05-07T20:32:42.7014871Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.7015463Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3dd3bf2430>} 2025-05-07T20:32:42.7016281Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.7016518Z context = 2025-05-07T20:32:42.7016522Z 2025-05-07T20:32:42.7016698Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.7016973Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.7017086Z module_map=module_map) 2025-05-07T20:32:42.7017255Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.7017361Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.7017443Z E ^ 2025-05-07T20:32:42.7017831Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.7017836Z 2025-05-07T20:32:42.7018285Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.7018293Z 2025-05-07T20:32:42.7018402Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.7018642Z self=, 2025-05-07T20:32:42.7018723Z T=128, 2025-05-07T20:32:42.7018808Z D=5120, 2025-05-07T20:32:42.7018895Z scale_ub=1200.0, 2025-05-07T20:32:42.7018987Z contiguous=False, 2025-05-07T20:32:42.7019079Z compiled=True, 2025-05-07T20:32:42.7019157Z ) 2025-05-07T20:32:42.7019390Z self = 2025-05-07T20:32:42.7019574Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:42.7019582Z 2025-05-07T20:32:42.7019662Z @given( 2025-05-07T20:32:42.7019787Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.7023755Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.7023903Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.7024034Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.7024154Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.7024235Z ) 2025-05-07T20:32:42.7024603Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.7024706Z def test_silu_mul_quant( 2025-05-07T20:32:42.7024786Z self, 2025-05-07T20:32:42.7024866Z T: int, 2025-05-07T20:32:42.7024948Z D: int, 2025-05-07T20:32:42.7025052Z scale_ub: Optional[float], 2025-05-07T20:32:42.7025152Z contiguous: bool, 2025-05-07T20:32:42.7025245Z compiled: bool, 2025-05-07T20:32:42.7025335Z ) -> None: 2025-05-07T20:32:42.7025434Z torch.manual_seed(2025) 2025-05-07T20:32:42.7025515Z 2025-05-07T20:32:42.7025698Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.7025778Z 2025-05-07T20:32:42.7025877Z x_sign = torch.sign(x) 2025-05-07T20:32:42.7026009Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.7026104Z x = x_sign * x_clamp 2025-05-07T20:32:42.7026196Z x0 = x[:, :D] 2025-05-07T20:32:42.7026278Z x1 = x[:, D:] 2025-05-07T20:32:42.7026371Z 2025-05-07T20:32:42.7026469Z if contiguous: 2025-05-07T20:32:42.7026580Z x0 = x0.contiguous() 2025-05-07T20:32:42.7026690Z x1 = x1.contiguous() 2025-05-07T20:32:42.7026776Z 2025-05-07T20:32:42.7026869Z if scale_ub is not None: 2025-05-07T20:32:42.7026984Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.7027127Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.7027253Z ) 2025-05-07T20:32:42.7027335Z else: 2025-05-07T20:32:42.7027431Z scale_ub_tensor = None 2025-05-07T20:32:42.7027507Z 2025-05-07T20:32:42.7027643Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.7027736Z op = silu_mul_quant 2025-05-07T20:32:42.7027867Z if compiled: 2025-05-07T20:32:42.7027968Z op = torch.compile(op) 2025-05-07T20:32:42.7028075Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.7028151Z 2025-05-07T20:32:42.7028249Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.7028254Z 2025-05-07T20:32:42.7028353Z moe/activation_test.py:117: 2025-05-07T20:32:42.7028493Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.7028594Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.7028697Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.7029105Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:42.7029208Z return fn(*args, **kwargs) 2025-05-07T20:32:42.7029845Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.7029948Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.7030348Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.7030595Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.7030966Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.7031063Z kernel = self.compile( 2025-05-07T20:32:42.7031482Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.7031668Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.7031808Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.7031813Z 2025-05-07T20:32:42.7032032Z self = 2025-05-07T20:32:42.7032890Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.7033536Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3dd3a36040>} 2025-05-07T20:32:42.7034357Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.7034571Z context = 2025-05-07T20:32:42.7034575Z 2025-05-07T20:32:42.7034751Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.7035040Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.7035154Z module_map=module_map) 2025-05-07T20:32:42.7035324Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.7035429Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.7035513Z E ^ 2025-05-07T20:32:42.7035900Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.7035904Z 2025-05-07T20:32:42.7036357Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.7036362Z 2025-05-07T20:32:42.7036469Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.7036752Z self=, 2025-05-07T20:32:42.7036834Z T=16384, 2025-05-07T20:32:42.7036912Z D=7168, 2025-05-07T20:32:42.7037005Z scale_ub=1200.0, 2025-05-07T20:32:42.7037092Z contiguous=True, 2025-05-07T20:32:42.7037177Z compiled=True, 2025-05-07T20:32:42.7037301Z ) 2025-05-07T20:32:42.7037527Z self = 2025-05-07T20:32:42.7037708Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:42.7037717Z 2025-05-07T20:32:42.7037803Z @given( 2025-05-07T20:32:42.7037926Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.7038031Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.7038150Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.7038268Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.7038387Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.7038468Z ) 2025-05-07T20:32:42.7038732Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.7038834Z def test_silu_mul_quant( 2025-05-07T20:32:42.7038912Z self, 2025-05-07T20:32:42.7038994Z T: int, 2025-05-07T20:32:42.7039081Z D: int, 2025-05-07T20:32:42.7039186Z scale_ub: Optional[float], 2025-05-07T20:32:42.7039282Z contiguous: bool, 2025-05-07T20:32:42.7039375Z compiled: bool, 2025-05-07T20:32:42.7039459Z ) -> None: 2025-05-07T20:32:42.7039565Z torch.manual_seed(2025) 2025-05-07T20:32:42.7039646Z 2025-05-07T20:32:42.7039826Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.7039903Z 2025-05-07T20:32:42.7039999Z x_sign = torch.sign(x) 2025-05-07T20:32:42.7040128Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.7040224Z x = x_sign * x_clamp 2025-05-07T20:32:42.7040314Z x0 = x[:, :D] 2025-05-07T20:32:42.7040397Z x1 = x[:, D:] 2025-05-07T20:32:42.7040477Z 2025-05-07T20:32:42.7040564Z if contiguous: 2025-05-07T20:32:42.7040657Z x0 = x0.contiguous() 2025-05-07T20:32:42.7040754Z x1 = x1.contiguous() 2025-05-07T20:32:42.7040835Z 2025-05-07T20:32:42.7040934Z if scale_ub is not None: 2025-05-07T20:32:42.7041044Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.7041184Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.7041356Z ) 2025-05-07T20:32:42.7041435Z else: 2025-05-07T20:32:42.7041530Z scale_ub_tensor = None 2025-05-07T20:32:42.7041603Z 2025-05-07T20:32:42.7041735Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.7041825Z op = silu_mul_quant 2025-05-07T20:32:42.7041914Z if compiled: 2025-05-07T20:32:42.7042016Z op = torch.compile(op) 2025-05-07T20:32:42.7042122Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.7042200Z 2025-05-07T20:32:42.7042291Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.7042295Z 2025-05-07T20:32:42.7042396Z moe/activation_test.py:117: 2025-05-07T20:32:42.7042528Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.7042636Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.7042744Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.7043144Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:42.7043242Z return fn(*args, **kwargs) 2025-05-07T20:32:42.7043788Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.7043890Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.7044280Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.7044558Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.7044920Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.7045021Z kernel = self.compile( 2025-05-07T20:32:42.7045496Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.7045678Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.7045820Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.7045825Z 2025-05-07T20:32:42.7046039Z self = 2025-05-07T20:32:42.7046891Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.7047448Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3dd3a36af0>} 2025-05-07T20:32:42.7048268Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.7048478Z context = 2025-05-07T20:32:42.7048483Z 2025-05-07T20:32:42.7048657Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.7048943Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.7049054Z module_map=module_map) 2025-05-07T20:32:42.7049228Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.7049336Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.7049420Z E ^ 2025-05-07T20:32:42.7049812Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.7049816Z 2025-05-07T20:32:42.7050270Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.7050276Z 2025-05-07T20:32:42.7050386Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.7050709Z self=, 2025-05-07T20:32:42.7050790Z T=16384, 2025-05-07T20:32:42.7050879Z D=5120, 2025-05-07T20:32:42.7050966Z scale_ub=1200.0, 2025-05-07T20:32:42.7051056Z contiguous=True, 2025-05-07T20:32:42.7051149Z compiled=False, 2025-05-07T20:32:42.7051225Z ) 2025-05-07T20:32:42.7051454Z self = 2025-05-07T20:32:42.7051650Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:42.7051655Z 2025-05-07T20:32:42.7051736Z @given( 2025-05-07T20:32:42.7051860Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.7051965Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.7052087Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.7052210Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.7052328Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.7052414Z ) 2025-05-07T20:32:42.7052679Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.7052777Z def test_silu_mul_quant( 2025-05-07T20:32:42.7052858Z self, 2025-05-07T20:32:42.7052943Z T: int, 2025-05-07T20:32:42.7053024Z D: int, 2025-05-07T20:32:42.7053128Z scale_ub: Optional[float], 2025-05-07T20:32:42.7053226Z contiguous: bool, 2025-05-07T20:32:42.7053359Z compiled: bool, 2025-05-07T20:32:42.7053439Z ) -> None: 2025-05-07T20:32:42.7053537Z torch.manual_seed(2025) 2025-05-07T20:32:42.7053610Z 2025-05-07T20:32:42.7053787Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.7053862Z 2025-05-07T20:32:42.7053999Z x_sign = torch.sign(x) 2025-05-07T20:32:42.7054126Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.7054215Z x = x_sign * x_clamp 2025-05-07T20:32:42.7054298Z x0 = x[:, :D] 2025-05-07T20:32:42.7054389Z x1 = x[:, D:] 2025-05-07T20:32:42.7054468Z 2025-05-07T20:32:42.7054553Z if contiguous: 2025-05-07T20:32:42.7054649Z x0 = x0.contiguous() 2025-05-07T20:32:42.7054739Z x1 = x1.contiguous() 2025-05-07T20:32:42.7054812Z 2025-05-07T20:32:42.7054908Z if scale_ub is not None: 2025-05-07T20:32:42.7055018Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.7055162Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.7055241Z ) 2025-05-07T20:32:42.7055323Z else: 2025-05-07T20:32:42.7055428Z scale_ub_tensor = None 2025-05-07T20:32:42.7055505Z 2025-05-07T20:32:42.7055638Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.7055737Z op = silu_mul_quant 2025-05-07T20:32:42.7055830Z if compiled: 2025-05-07T20:32:42.7055936Z op = torch.compile(op) 2025-05-07T20:32:42.7056054Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.7056132Z 2025-05-07T20:32:42.7056224Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.7056229Z 2025-05-07T20:32:42.7056335Z moe/activation_test.py:117: 2025-05-07T20:32:42.7056475Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.7056582Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.7056685Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.7057234Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.7057338Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.7057728Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.7057968Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.7059671Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.7059770Z kernel = self.compile( 2025-05-07T20:32:42.7060184Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.7060362Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.7060494Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.7060501Z 2025-05-07T20:32:42.7060715Z self = 2025-05-07T20:32:42.7061561Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.7062119Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3dd3a04550>} 2025-05-07T20:32:42.7062930Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.7063131Z context = 2025-05-07T20:32:42.7063136Z 2025-05-07T20:32:42.7063347Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.7063623Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.7063732Z module_map=module_map) 2025-05-07T20:32:42.7063897Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.7064040Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.7064124Z E ^ 2025-05-07T20:32:42.7064509Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.7064514Z 2025-05-07T20:32:42.7064967Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.7064972Z 2025-05-07T20:32:42.7065081Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.7065316Z self=, 2025-05-07T20:32:42.7065396Z T=1, 2025-05-07T20:32:42.7065480Z D=7168, 2025-05-07T20:32:42.7065567Z scale_ub=1200.0, 2025-05-07T20:32:42.7065660Z contiguous=False, 2025-05-07T20:32:42.7065745Z compiled=False, 2025-05-07T20:32:42.7065817Z ) 2025-05-07T20:32:42.7066052Z self = 2025-05-07T20:32:42.7066233Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:42.7066238Z 2025-05-07T20:32:42.7066323Z @given( 2025-05-07T20:32:42.7066454Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.7066578Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.7066716Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.7066844Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.7066960Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.7067044Z ) 2025-05-07T20:32:42.7067310Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.7067410Z def test_silu_mul_quant( 2025-05-07T20:32:42.7067488Z self, 2025-05-07T20:32:42.7067570Z T: int, 2025-05-07T20:32:42.7067654Z D: int, 2025-05-07T20:32:42.7067760Z scale_ub: Optional[float], 2025-05-07T20:32:42.7067852Z contiguous: bool, 2025-05-07T20:32:42.7067943Z compiled: bool, 2025-05-07T20:32:42.7068028Z ) -> None: 2025-05-07T20:32:42.7068132Z torch.manual_seed(2025) 2025-05-07T20:32:42.7068213Z 2025-05-07T20:32:42.7068470Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.7068551Z 2025-05-07T20:32:42.7068647Z x_sign = torch.sign(x) 2025-05-07T20:32:42.7068773Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.7068869Z x = x_sign * x_clamp 2025-05-07T20:32:42.7068956Z x0 = x[:, :D] 2025-05-07T20:32:42.7069041Z x1 = x[:, D:] 2025-05-07T20:32:42.7069123Z 2025-05-07T20:32:42.7069215Z if contiguous: 2025-05-07T20:32:42.7069311Z x0 = x0.contiguous() 2025-05-07T20:32:42.7069407Z x1 = x1.contiguous() 2025-05-07T20:32:42.7069486Z 2025-05-07T20:32:42.7069582Z if scale_ub is not None: 2025-05-07T20:32:42.7069696Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.7069907Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.7069991Z ) 2025-05-07T20:32:42.7070076Z else: 2025-05-07T20:32:42.7070174Z scale_ub_tensor = None 2025-05-07T20:32:42.7070260Z 2025-05-07T20:32:42.7070400Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.7070496Z op = silu_mul_quant 2025-05-07T20:32:42.7070590Z if compiled: 2025-05-07T20:32:42.7070695Z op = torch.compile(op) 2025-05-07T20:32:42.7070807Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.7070887Z 2025-05-07T20:32:42.7070983Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.7071036Z 2025-05-07T20:32:42.7071136Z moe/activation_test.py:117: 2025-05-07T20:32:42.7071276Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.7071378Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.7071480Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.7072065Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.7072161Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.7072555Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.7072790Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.7073154Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.7073251Z kernel = self.compile( 2025-05-07T20:32:42.7073667Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.7073853Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.7073989Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.7073997Z 2025-05-07T20:32:42.7074213Z self = 2025-05-07T20:32:42.7075073Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.7075632Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3dd3a04820>} 2025-05-07T20:32:42.7076458Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.7076665Z context = 2025-05-07T20:32:42.7076669Z 2025-05-07T20:32:42.7076843Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.7077132Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.7077321Z module_map=module_map) 2025-05-07T20:32:42.7077490Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.7077594Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.7077677Z E ^ 2025-05-07T20:32:42.7078065Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.7078070Z 2025-05-07T20:32:42.7078528Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.7078533Z 2025-05-07T20:32:42.7078642Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.7078876Z self=, 2025-05-07T20:32:42.7078958Z T=4096, 2025-05-07T20:32:42.7079047Z D=7168, 2025-05-07T20:32:42.7079133Z scale_ub=1200.0, 2025-05-07T20:32:42.7079228Z contiguous=False, 2025-05-07T20:32:42.7079317Z compiled=True, 2025-05-07T20:32:42.7079395Z ) 2025-05-07T20:32:42.7079633Z self = 2025-05-07T20:32:42.7079821Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:42.7079826Z 2025-05-07T20:32:42.7079906Z @given( 2025-05-07T20:32:42.7080033Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.7080135Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.7080317Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.7080440Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.7080554Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.7080627Z ) 2025-05-07T20:32:42.7080887Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.7081023Z def test_silu_mul_quant( 2025-05-07T20:32:42.7081100Z self, 2025-05-07T20:32:42.7081186Z T: int, 2025-05-07T20:32:42.7081262Z D: int, 2025-05-07T20:32:42.7081367Z scale_ub: Optional[float], 2025-05-07T20:32:42.7081464Z contiguous: bool, 2025-05-07T20:32:42.7081550Z compiled: bool, 2025-05-07T20:32:42.7081632Z ) -> None: 2025-05-07T20:32:42.7081725Z torch.manual_seed(2025) 2025-05-07T20:32:42.7081800Z 2025-05-07T20:32:42.7081980Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.7082055Z 2025-05-07T20:32:42.7082152Z x_sign = torch.sign(x) 2025-05-07T20:32:42.7082281Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.7082369Z x = x_sign * x_clamp 2025-05-07T20:32:42.7082455Z x0 = x[:, :D] 2025-05-07T20:32:42.7082542Z x1 = x[:, D:] 2025-05-07T20:32:42.7082624Z 2025-05-07T20:32:42.7082715Z if contiguous: 2025-05-07T20:32:42.7083000Z x0 = x0.contiguous() 2025-05-07T20:32:42.7083128Z x1 = x1.contiguous() 2025-05-07T20:32:42.7083241Z 2025-05-07T20:32:42.7083360Z if scale_ub is not None: 2025-05-07T20:32:42.7083467Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.7083608Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.7083683Z ) 2025-05-07T20:32:42.7083761Z else: 2025-05-07T20:32:42.7083864Z scale_ub_tensor = None 2025-05-07T20:32:42.7083937Z 2025-05-07T20:32:42.7084070Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.7084167Z op = silu_mul_quant 2025-05-07T20:32:42.7084252Z if compiled: 2025-05-07T20:32:42.7084360Z op = torch.compile(op) 2025-05-07T20:32:42.7084470Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.7084542Z 2025-05-07T20:32:42.7084637Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.7084649Z 2025-05-07T20:32:42.7084747Z moe/activation_test.py:117: 2025-05-07T20:32:42.7084877Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.7085128Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.7085241Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.7085641Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:42.7085737Z return fn(*args, **kwargs) 2025-05-07T20:32:42.7086283Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.7086390Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.7086782Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.7087019Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.7087389Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.7087488Z kernel = self.compile( 2025-05-07T20:32:42.7087907Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.7088098Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.7088233Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.7088238Z 2025-05-07T20:32:42.7088451Z self = 2025-05-07T20:32:42.7089367Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.7089920Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3dd3971ca0>} 2025-05-07T20:32:42.7090803Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.7091006Z context = 2025-05-07T20:32:42.7091010Z 2025-05-07T20:32:42.7091184Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.7091466Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.7091581Z module_map=module_map) 2025-05-07T20:32:42.7091746Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.7091848Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.7091931Z E ^ 2025-05-07T20:32:42.7092321Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.7092328Z 2025-05-07T20:32:42.7092782Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.7092786Z 2025-05-07T20:32:42.7092897Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.7093132Z self=, 2025-05-07T20:32:42.7093211Z T=128, 2025-05-07T20:32:42.7093297Z D=7168, 2025-05-07T20:32:42.7093382Z scale_ub=1200.0, 2025-05-07T20:32:42.7093476Z contiguous=False, 2025-05-07T20:32:42.7093563Z compiled=True, 2025-05-07T20:32:42.7093638Z ) 2025-05-07T20:32:42.7093870Z self = 2025-05-07T20:32:42.7094055Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:42.7094060Z 2025-05-07T20:32:42.7094146Z @given( 2025-05-07T20:32:42.7094272Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.7094376Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.7094627Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.7094751Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.7094867Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.7094940Z ) 2025-05-07T20:32:42.7095201Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.7095293Z def test_silu_mul_quant( 2025-05-07T20:32:42.7095376Z self, 2025-05-07T20:32:42.7095464Z T: int, 2025-05-07T20:32:42.7095546Z D: int, 2025-05-07T20:32:42.7095649Z scale_ub: Optional[float], 2025-05-07T20:32:42.7095745Z contiguous: bool, 2025-05-07T20:32:42.7095837Z compiled: bool, 2025-05-07T20:32:42.7095924Z ) -> None: 2025-05-07T20:32:42.7096023Z torch.manual_seed(2025) 2025-05-07T20:32:42.7096107Z 2025-05-07T20:32:42.7096289Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.7096371Z 2025-05-07T20:32:42.7096467Z x_sign = torch.sign(x) 2025-05-07T20:32:42.7096607Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.7096700Z x = x_sign * x_clamp 2025-05-07T20:32:42.7096784Z x0 = x[:, :D] 2025-05-07T20:32:42.7096872Z x1 = x[:, D:] 2025-05-07T20:32:42.7096950Z 2025-05-07T20:32:42.7097039Z if contiguous: 2025-05-07T20:32:42.7097135Z x0 = x0.contiguous() 2025-05-07T20:32:42.7097229Z x1 = x1.contiguous() 2025-05-07T20:32:42.7097354Z 2025-05-07T20:32:42.7097447Z if scale_ub is not None: 2025-05-07T20:32:42.7097553Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.7097693Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.7097770Z ) 2025-05-07T20:32:42.7097844Z else: 2025-05-07T20:32:42.7097986Z scale_ub_tensor = None 2025-05-07T20:32:42.7098062Z 2025-05-07T20:32:42.7098193Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.7098292Z op = silu_mul_quant 2025-05-07T20:32:42.7098380Z if compiled: 2025-05-07T20:32:42.7098483Z op = torch.compile(op) 2025-05-07T20:32:42.7098596Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.7098667Z 2025-05-07T20:32:42.7098757Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.7098765Z 2025-05-07T20:32:42.7098864Z moe/activation_test.py:117: 2025-05-07T20:32:42.7098999Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.7099106Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.7099210Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.7099608Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:42.7099714Z return fn(*args, **kwargs) 2025-05-07T20:32:42.7100257Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.7100366Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.7100758Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.7100997Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.7101367Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.7101467Z kernel = self.compile( 2025-05-07T20:32:42.7101881Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.7102065Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.7102200Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.7102208Z 2025-05-07T20:32:42.7102429Z self = 2025-05-07T20:32:42.7103362Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.7103915Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3dd3e2e8b0>} 2025-05-07T20:32:42.7104742Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.7104941Z context = 2025-05-07T20:32:42.7104948Z 2025-05-07T20:32:42.7105124Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.7105401Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.7105512Z module_map=module_map) 2025-05-07T20:32:42.7105679Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.7105778Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.7105862Z E ^ 2025-05-07T20:32:42.7106251Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.7106296Z 2025-05-07T20:32:42.7106744Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.7106749Z 2025-05-07T20:32:42.7106859Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.7107094Z self=, 2025-05-07T20:32:42.7107214Z T=2048, 2025-05-07T20:32:42.7107290Z D=7168, 2025-05-07T20:32:42.7107374Z scale_ub=None, 2025-05-07T20:32:42.7107462Z contiguous=True, 2025-05-07T20:32:42.7107553Z compiled=True, 2025-05-07T20:32:42.7107628Z ) 2025-05-07T20:32:42.7107859Z self = 2025-05-07T20:32:42.7108040Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:42.7108045Z 2025-05-07T20:32:42.7108126Z @given( 2025-05-07T20:32:42.7108252Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.7108358Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.7108483Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.7108603Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.7108722Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.7108805Z ) 2025-05-07T20:32:42.7109068Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.7109170Z def test_silu_mul_quant( 2025-05-07T20:32:42.7109257Z self, 2025-05-07T20:32:42.7109339Z T: int, 2025-05-07T20:32:42.7109429Z D: int, 2025-05-07T20:32:42.7109537Z scale_ub: Optional[float], 2025-05-07T20:32:42.7109631Z contiguous: bool, 2025-05-07T20:32:42.7109815Z compiled: bool, 2025-05-07T20:32:42.7109902Z ) -> None: 2025-05-07T20:32:42.7109997Z torch.manual_seed(2025) 2025-05-07T20:32:42.7110082Z 2025-05-07T20:32:42.7110261Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.7110343Z 2025-05-07T20:32:42.7110440Z x_sign = torch.sign(x) 2025-05-07T20:32:42.7110566Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.7110660Z x = x_sign * x_clamp 2025-05-07T20:32:42.7110746Z x0 = x[:, :D] 2025-05-07T20:32:42.7110832Z x1 = x[:, D:] 2025-05-07T20:32:42.7110913Z 2025-05-07T20:32:42.7111004Z if contiguous: 2025-05-07T20:32:42.7111099Z x0 = x0.contiguous() 2025-05-07T20:32:42.7111191Z x1 = x1.contiguous() 2025-05-07T20:32:42.7111274Z 2025-05-07T20:32:42.7111475Z if scale_ub is not None: 2025-05-07T20:32:42.7111584Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.7111725Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.7111803Z ) 2025-05-07T20:32:42.7111885Z else: 2025-05-07T20:32:42.7111983Z scale_ub_tensor = None 2025-05-07T20:32:42.7112061Z 2025-05-07T20:32:42.7112202Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.7112297Z op = silu_mul_quant 2025-05-07T20:32:42.7112387Z if compiled: 2025-05-07T20:32:42.7112494Z op = torch.compile(op) 2025-05-07T20:32:42.7112603Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.7112680Z 2025-05-07T20:32:42.7112780Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.7112784Z 2025-05-07T20:32:42.7112884Z moe/activation_test.py:117: 2025-05-07T20:32:42.7113027Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.7113131Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.7113235Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.7113639Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:42.7113737Z return fn(*args, **kwargs) 2025-05-07T20:32:42.7114276Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.7114418Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.7114803Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.7115038Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.7115440Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.7115539Z kernel = self.compile( 2025-05-07T20:32:42.7115952Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.7116133Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.7116264Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.7116271Z 2025-05-07T20:32:42.7116484Z self = 2025-05-07T20:32:42.7117336Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.7117897Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3dd379e550>} 2025-05-07T20:32:42.7118719Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.7118924Z context = 2025-05-07T20:32:42.7118929Z 2025-05-07T20:32:42.7119101Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.7119383Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.7119496Z module_map=module_map) 2025-05-07T20:32:42.7119662Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.7119769Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.7119854Z E ^ 2025-05-07T20:32:42.7120237Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.7120241Z 2025-05-07T20:32:42.7120768Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.7120773Z 2025-05-07T20:32:42.7120878Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.7121110Z self=, 2025-05-07T20:32:42.7121197Z T=16384, 2025-05-07T20:32:42.7121283Z D=5120, 2025-05-07T20:32:42.7121374Z scale_ub=None, 2025-05-07T20:32:42.7121465Z contiguous=False, 2025-05-07T20:32:42.7121552Z compiled=False, 2025-05-07T20:32:42.7121634Z ) 2025-05-07T20:32:42.7121866Z self = 2025-05-07T20:32:42.7122055Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:42.7122062Z 2025-05-07T20:32:42.7122149Z @given( 2025-05-07T20:32:42.7122273Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.7122378Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.7122510Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.7122631Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.7122750Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.7122833Z ) 2025-05-07T20:32:42.7123096Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.7123199Z def test_silu_mul_quant( 2025-05-07T20:32:42.7123325Z self, 2025-05-07T20:32:42.7123408Z T: int, 2025-05-07T20:32:42.7123493Z D: int, 2025-05-07T20:32:42.7123596Z scale_ub: Optional[float], 2025-05-07T20:32:42.7123688Z contiguous: bool, 2025-05-07T20:32:42.7123780Z compiled: bool, 2025-05-07T20:32:42.7123866Z ) -> None: 2025-05-07T20:32:42.7124001Z torch.manual_seed(2025) 2025-05-07T20:32:42.7124077Z 2025-05-07T20:32:42.7124251Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.7124329Z 2025-05-07T20:32:42.7124430Z x_sign = torch.sign(x) 2025-05-07T20:32:42.7124561Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.7126564Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:42.7126573Z 2025-05-07T20:32:42.7126698Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:42.7126703Z 2025-05-07T20:32:42.7126811Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.7127051Z self=, 2025-05-07T20:32:42.7127131Z T=4096, 2025-05-07T20:32:42.7127217Z D=7168, 2025-05-07T20:32:42.7127303Z scale_ub=1200.0, 2025-05-07T20:32:42.7127394Z contiguous=True, 2025-05-07T20:32:42.7127485Z compiled=True, 2025-05-07T20:32:42.7127564Z ) 2025-05-07T20:32:42.7127798Z self = 2025-05-07T20:32:42.7127981Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:42.7127986Z 2025-05-07T20:32:42.7128066Z @given( 2025-05-07T20:32:42.7128189Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.7128291Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.7128410Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.7128539Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.7128655Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.7128734Z ) 2025-05-07T20:32:42.7129082Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.7129177Z def test_silu_mul_quant( 2025-05-07T20:32:42.7129258Z self, 2025-05-07T20:32:42.7129335Z T: int, 2025-05-07T20:32:42.7129413Z D: int, 2025-05-07T20:32:42.7129515Z scale_ub: Optional[float], 2025-05-07T20:32:42.7129603Z contiguous: bool, 2025-05-07T20:32:42.7129695Z compiled: bool, 2025-05-07T20:32:42.7129778Z ) -> None: 2025-05-07T20:32:42.7129873Z torch.manual_seed(2025) 2025-05-07T20:32:42.7129951Z 2025-05-07T20:32:42.7130130Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.7130204Z 2025-05-07T20:32:42.7130295Z x_sign = torch.sign(x) 2025-05-07T20:32:42.7130427Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.7132408Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:42.7132455Z 2025-05-07T20:32:42.7132576Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:42.7132581Z 2025-05-07T20:32:42.7132685Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.7132917Z self=, 2025-05-07T20:32:42.7133035Z T=16384, 2025-05-07T20:32:42.7133109Z D=7168, 2025-05-07T20:32:42.7133195Z scale_ub=None, 2025-05-07T20:32:42.7133281Z contiguous=False, 2025-05-07T20:32:42.7133366Z compiled=False, 2025-05-07T20:32:42.7133444Z ) 2025-05-07T20:32:42.7133676Z self = 2025-05-07T20:32:42.7133859Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:42.7133866Z 2025-05-07T20:32:42.7133944Z @given( 2025-05-07T20:32:42.7134063Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.7134165Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.7134283Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.7134402Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.7134522Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.7134598Z ) 2025-05-07T20:32:42.7134857Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.7134957Z def test_silu_mul_quant( 2025-05-07T20:32:42.7135034Z self, 2025-05-07T20:32:42.7135111Z T: int, 2025-05-07T20:32:42.7135192Z D: int, 2025-05-07T20:32:42.7135297Z scale_ub: Optional[float], 2025-05-07T20:32:42.7135395Z contiguous: bool, 2025-05-07T20:32:42.7135484Z compiled: bool, 2025-05-07T20:32:42.7135564Z ) -> None: 2025-05-07T20:32:42.7135666Z torch.manual_seed(2025) 2025-05-07T20:32:42.7135746Z 2025-05-07T20:32:42.7135920Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.7137905Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:42.7137993Z 2025-05-07T20:32:42.7138114Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:42.7138118Z 2025-05-07T20:32:42.7138227Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.7138456Z self=, 2025-05-07T20:32:42.7138538Z T=2048, 2025-05-07T20:32:42.7138624Z D=7168, 2025-05-07T20:32:42.7138709Z scale_ub=1200.0, 2025-05-07T20:32:42.7138803Z contiguous=True, 2025-05-07T20:32:42.7138892Z compiled=True, 2025-05-07T20:32:42.7138971Z ) 2025-05-07T20:32:42.7139204Z self = 2025-05-07T20:32:42.7139384Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:42.7139391Z 2025-05-07T20:32:42.7139473Z @given( 2025-05-07T20:32:42.7139600Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.7139701Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.7139825Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.7139950Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.7140067Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.7140148Z ) 2025-05-07T20:32:42.7143059Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.7143175Z def test_silu_mul_quant( 2025-05-07T20:32:42.7143314Z self, 2025-05-07T20:32:42.7143398Z T: int, 2025-05-07T20:32:42.7143477Z D: int, 2025-05-07T20:32:42.7143579Z scale_ub: Optional[float], 2025-05-07T20:32:42.7143677Z contiguous: bool, 2025-05-07T20:32:42.7143766Z compiled: bool, 2025-05-07T20:32:42.7143852Z ) -> None: 2025-05-07T20:32:42.7143950Z torch.manual_seed(2025) 2025-05-07T20:32:42.7144092Z 2025-05-07T20:32:42.7144269Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.7144344Z 2025-05-07T20:32:42.7144437Z x_sign = torch.sign(x) 2025-05-07T20:32:42.7144568Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.7146549Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:42.7146558Z 2025-05-07T20:32:42.7146675Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:42.7146683Z 2025-05-07T20:32:42.7146797Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.7147033Z self=, 2025-05-07T20:32:42.7147120Z T=2048, 2025-05-07T20:32:42.7147202Z D=7168, 2025-05-07T20:32:42.7147289Z scale_ub=None, 2025-05-07T20:32:42.7147380Z contiguous=True, 2025-05-07T20:32:42.7147467Z compiled=False, 2025-05-07T20:32:42.7147545Z ) 2025-05-07T20:32:42.7147785Z self = 2025-05-07T20:32:42.7147967Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:42.7147974Z 2025-05-07T20:32:42.7148058Z @given( 2025-05-07T20:32:42.7148184Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.7148287Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.7148412Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.7148536Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.7148652Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.7148734Z ) 2025-05-07T20:32:42.7149043Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.7149140Z def test_silu_mul_quant( 2025-05-07T20:32:42.7149226Z self, 2025-05-07T20:32:42.7149309Z T: int, 2025-05-07T20:32:42.7149393Z D: int, 2025-05-07T20:32:42.7149498Z scale_ub: Optional[float], 2025-05-07T20:32:42.7149591Z contiguous: bool, 2025-05-07T20:32:42.7149678Z compiled: bool, 2025-05-07T20:32:42.7149939Z ) -> None: 2025-05-07T20:32:42.7150036Z torch.manual_seed(2025) 2025-05-07T20:32:42.7150119Z 2025-05-07T20:32:42.7150298Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.7150375Z 2025-05-07T20:32:42.7150474Z > x_sign = torch.sign(x) 2025-05-07T20:32:42.7152435Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:42.7152528Z 2025-05-07T20:32:42.7152652Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:32:42.7152697Z 2025-05-07T20:32:42.7152801Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.7153032Z self=, 2025-05-07T20:32:42.7153113Z T=1, 2025-05-07T20:32:42.7153195Z D=7168, 2025-05-07T20:32:42.7153281Z scale_ub=1200.0, 2025-05-07T20:32:42.7153408Z contiguous=True, 2025-05-07T20:32:42.7153491Z compiled=False, 2025-05-07T20:32:42.7153565Z ) 2025-05-07T20:32:42.7153798Z self = 2025-05-07T20:32:42.7153970Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:42.7153975Z 2025-05-07T20:32:42.7154056Z @given( 2025-05-07T20:32:42.7154176Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.7154278Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.7154404Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.7154522Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.7154646Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.7154724Z ) 2025-05-07T20:32:42.7154987Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.7155086Z def test_silu_mul_quant( 2025-05-07T20:32:42.7159244Z self, 2025-05-07T20:32:42.7159346Z T: int, 2025-05-07T20:32:42.7159426Z D: int, 2025-05-07T20:32:42.7159529Z scale_ub: Optional[float], 2025-05-07T20:32:42.7159623Z contiguous: bool, 2025-05-07T20:32:42.7159711Z compiled: bool, 2025-05-07T20:32:42.7159791Z ) -> None: 2025-05-07T20:32:42.7159885Z torch.manual_seed(2025) 2025-05-07T20:32:42.7159958Z 2025-05-07T20:32:42.7160141Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.7160219Z 2025-05-07T20:32:42.7160314Z x_sign = torch.sign(x) 2025-05-07T20:32:42.7160443Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.7160535Z x = x_sign * x_clamp 2025-05-07T20:32:42.7160616Z x0 = x[:, :D] 2025-05-07T20:32:42.7160701Z x1 = x[:, D:] 2025-05-07T20:32:42.7160775Z 2025-05-07T20:32:42.7160855Z if contiguous: 2025-05-07T20:32:42.7160950Z x0 = x0.contiguous() 2025-05-07T20:32:42.7161042Z x1 = x1.contiguous() 2025-05-07T20:32:42.7161114Z 2025-05-07T20:32:42.7161207Z if scale_ub is not None: 2025-05-07T20:32:42.7161312Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.7161518Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.7161595Z ) 2025-05-07T20:32:42.7161670Z else: 2025-05-07T20:32:42.7161768Z scale_ub_tensor = None 2025-05-07T20:32:42.7161841Z 2025-05-07T20:32:42.7161974Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.7162076Z op = silu_mul_quant 2025-05-07T20:32:42.7162165Z if compiled: 2025-05-07T20:32:42.7162267Z op = torch.compile(op) 2025-05-07T20:32:42.7162376Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.7162447Z 2025-05-07T20:32:42.7162537Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.7162545Z 2025-05-07T20:32:42.7162645Z moe/activation_test.py:117: 2025-05-07T20:32:42.7162779Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.7162882Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.7162979Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.7163535Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.7163642Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.7164098Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.7164341Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.7164745Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.7164838Z kernel = self.compile( 2025-05-07T20:32:42.7165254Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.7165482Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.7165614Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.7165621Z 2025-05-07T20:32:42.7165840Z self = 2025-05-07T20:32:42.7166700Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.7167260Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3dd35eb0d0>} 2025-05-07T20:32:42.7168082Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.7168286Z context = 2025-05-07T20:32:42.7168291Z 2025-05-07T20:32:42.7168465Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.7168747Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.7168863Z module_map=module_map) 2025-05-07T20:32:42.7169031Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.7169135Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.7169220Z E ^ 2025-05-07T20:32:42.7169610Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.7169615Z 2025-05-07T20:32:42.7170075Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.7170082Z 2025-05-07T20:32:42.7170190Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.7170428Z self=, 2025-05-07T20:32:42.7170514Z T=128, 2025-05-07T20:32:42.7170638Z D=5120, 2025-05-07T20:32:42.7170721Z scale_ub=None, 2025-05-07T20:32:42.7170806Z contiguous=True, 2025-05-07T20:32:42.7170889Z compiled=False, 2025-05-07T20:32:42.7170967Z ) 2025-05-07T20:32:42.7171194Z self = 2025-05-07T20:32:42.7171374Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:42.7171382Z 2025-05-07T20:32:42.7171467Z @given( 2025-05-07T20:32:42.7171589Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.7171694Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.7171812Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.7171931Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.7172055Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.7172132Z ) 2025-05-07T20:32:42.7172396Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.7172494Z def test_silu_mul_quant( 2025-05-07T20:32:42.7172575Z self, 2025-05-07T20:32:42.7172654Z T: int, 2025-05-07T20:32:42.7172738Z D: int, 2025-05-07T20:32:42.7172838Z scale_ub: Optional[float], 2025-05-07T20:32:42.7172931Z contiguous: bool, 2025-05-07T20:32:42.7173066Z compiled: bool, 2025-05-07T20:32:42.7173146Z ) -> None: 2025-05-07T20:32:42.7173284Z torch.manual_seed(2025) 2025-05-07T20:32:42.7173355Z 2025-05-07T20:32:42.7173526Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.7173605Z 2025-05-07T20:32:42.7173694Z x_sign = torch.sign(x) 2025-05-07T20:32:42.7173819Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.7173954Z x = x_sign * x_clamp 2025-05-07T20:32:42.7174037Z x0 = x[:, :D] 2025-05-07T20:32:42.7174115Z x1 = x[:, D:] 2025-05-07T20:32:42.7174194Z 2025-05-07T20:32:42.7174281Z if contiguous: 2025-05-07T20:32:42.7174369Z x0 = x0.contiguous() 2025-05-07T20:32:42.7174463Z x1 = x1.contiguous() 2025-05-07T20:32:42.7174535Z 2025-05-07T20:32:42.7174627Z if scale_ub is not None: 2025-05-07T20:32:42.7174733Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.7174871Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.7174948Z ) 2025-05-07T20:32:42.7175022Z else: 2025-05-07T20:32:42.7175116Z scale_ub_tensor = None 2025-05-07T20:32:42.7175193Z 2025-05-07T20:32:42.7175323Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.7175411Z op = silu_mul_quant 2025-05-07T20:32:42.7175500Z if compiled: 2025-05-07T20:32:42.7175603Z op = torch.compile(op) 2025-05-07T20:32:42.7175710Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.7175787Z 2025-05-07T20:32:42.7175876Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.7175883Z 2025-05-07T20:32:42.7175984Z moe/activation_test.py:117: 2025-05-07T20:32:42.7176116Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.7176218Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.7176319Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.7176870Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.7176973Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.7177367Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.7177603Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.7177974Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.7178071Z kernel = self.compile( 2025-05-07T20:32:42.7178531Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.7178717Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.7178848Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.7178855Z 2025-05-07T20:32:42.7179070Z self = 2025-05-07T20:32:42.7179922Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.7180470Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3dd35ebaf0>} 2025-05-07T20:32:42.7181291Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.7181489Z context = 2025-05-07T20:32:42.7181493Z 2025-05-07T20:32:42.7181718Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.7181998Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.7182170Z module_map=module_map) 2025-05-07T20:32:42.7182336Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.7182434Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.7182514Z E ^ 2025-05-07T20:32:42.7183241Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.7183247Z 2025-05-07T20:32:42.7183699Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.7183704Z 2025-05-07T20:32:42.7183810Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.7184042Z self=, 2025-05-07T20:32:42.7184122Z T=128, 2025-05-07T20:32:42.7184213Z D=7168, 2025-05-07T20:32:42.7184300Z scale_ub=None, 2025-05-07T20:32:42.7184396Z contiguous=True, 2025-05-07T20:32:42.7184485Z compiled=False, 2025-05-07T20:32:42.7184562Z ) 2025-05-07T20:32:42.7184795Z self = 2025-05-07T20:32:42.7184975Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:42.7184980Z 2025-05-07T20:32:42.7185070Z @given( 2025-05-07T20:32:42.7185200Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.7185304Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.7185426Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.7185550Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.7185665Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.7185745Z ) 2025-05-07T20:32:42.7186006Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.7186103Z def test_silu_mul_quant( 2025-05-07T20:32:42.7186188Z self, 2025-05-07T20:32:42.7186267Z T: int, 2025-05-07T20:32:42.7186347Z D: int, 2025-05-07T20:32:42.7186451Z scale_ub: Optional[float], 2025-05-07T20:32:42.7186565Z contiguous: bool, 2025-05-07T20:32:42.7186658Z compiled: bool, 2025-05-07T20:32:42.7186761Z ) -> None: 2025-05-07T20:32:42.7186858Z torch.manual_seed(2025) 2025-05-07T20:32:42.7186936Z 2025-05-07T20:32:42.7187114Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.7187192Z 2025-05-07T20:32:42.7187384Z x_sign = torch.sign(x) 2025-05-07T20:32:42.7187513Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.7187602Z x = x_sign * x_clamp 2025-05-07T20:32:42.7187686Z x0 = x[:, :D] 2025-05-07T20:32:42.7187768Z x1 = x[:, D:] 2025-05-07T20:32:42.7187840Z 2025-05-07T20:32:42.7187927Z if contiguous: 2025-05-07T20:32:42.7188021Z x0 = x0.contiguous() 2025-05-07T20:32:42.7188110Z x1 = x1.contiguous() 2025-05-07T20:32:42.7188187Z 2025-05-07T20:32:42.7188278Z if scale_ub is not None: 2025-05-07T20:32:42.7188381Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.7188518Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.7188590Z ) 2025-05-07T20:32:42.7188671Z else: 2025-05-07T20:32:42.7188770Z scale_ub_tensor = None 2025-05-07T20:32:42.7188847Z 2025-05-07T20:32:42.7188982Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.7189077Z op = silu_mul_quant 2025-05-07T20:32:42.7189162Z if compiled: 2025-05-07T20:32:42.7189268Z op = torch.compile(op) 2025-05-07T20:32:42.7189374Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.7189449Z 2025-05-07T20:32:42.7189547Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.7189622Z 2025-05-07T20:32:42.7189776Z moe/activation_test.py:117: 2025-05-07T20:32:42.7189968Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.7190078Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.7190180Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.7190727Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.7190885Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.7191270Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.7191511Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.7191879Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.7191978Z kernel = self.compile( 2025-05-07T20:32:42.7192398Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.7192582Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.7192720Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.7192724Z 2025-05-07T20:32:42.7192941Z self = 2025-05-07T20:32:42.7193794Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.7194343Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3dd36f0550>} 2025-05-07T20:32:42.7195160Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.7195364Z context = 2025-05-07T20:32:42.7195368Z 2025-05-07T20:32:42.7195541Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.7195827Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.7195939Z module_map=module_map) 2025-05-07T20:32:42.7196106Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.7196253Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.7196333Z E ^ 2025-05-07T20:32:42.7196715Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.7196720Z 2025-05-07T20:32:42.7197174Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.7197181Z 2025-05-07T20:32:42.7197282Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.7197518Z self=, 2025-05-07T20:32:42.7197599Z T=2048, 2025-05-07T20:32:42.7197678Z D=7168, 2025-05-07T20:32:42.7197767Z scale_ub=1200.0, 2025-05-07T20:32:42.7197855Z contiguous=True, 2025-05-07T20:32:42.7197942Z compiled=False, 2025-05-07T20:32:42.7198023Z ) 2025-05-07T20:32:42.7198254Z self = 2025-05-07T20:32:42.7198442Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:42.7198451Z 2025-05-07T20:32:42.7198532Z @given( 2025-05-07T20:32:42.7198654Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.7198760Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.7198926Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.7199047Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.7199203Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.7199277Z ) 2025-05-07T20:32:42.7199536Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.7199636Z def test_silu_mul_quant( 2025-05-07T20:32:42.7199716Z self, 2025-05-07T20:32:42.7199837Z T: int, 2025-05-07T20:32:42.7199915Z D: int, 2025-05-07T20:32:42.7200013Z scale_ub: Optional[float], 2025-05-07T20:32:42.7200105Z contiguous: bool, 2025-05-07T20:32:42.7200196Z compiled: bool, 2025-05-07T20:32:42.7200273Z ) -> None: 2025-05-07T20:32:42.7200370Z torch.manual_seed(2025) 2025-05-07T20:32:42.7200444Z 2025-05-07T20:32:42.7200618Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.7202599Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:42.7202610Z 2025-05-07T20:32:42.7202726Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:42.7202731Z 2025-05-07T20:32:42.7202840Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.7203072Z self=, 2025-05-07T20:32:42.7203156Z T=1, 2025-05-07T20:32:42.7203237Z D=5120, 2025-05-07T20:32:42.7203324Z scale_ub=1200.0, 2025-05-07T20:32:42.7203417Z contiguous=True, 2025-05-07T20:32:42.7203505Z compiled=False, 2025-05-07T20:32:42.7203584Z ) 2025-05-07T20:32:42.7203819Z self = 2025-05-07T20:32:42.7203991Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:42.7203995Z 2025-05-07T20:32:42.7204074Z @given( 2025-05-07T20:32:42.7204197Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.7204303Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.7204422Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.7204543Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.7204705Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.7204787Z ) 2025-05-07T20:32:42.7205046Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.7205137Z def test_silu_mul_quant( 2025-05-07T20:32:42.7205220Z self, 2025-05-07T20:32:42.7205305Z T: int, 2025-05-07T20:32:42.7205386Z D: int, 2025-05-07T20:32:42.7205496Z scale_ub: Optional[float], 2025-05-07T20:32:42.7205586Z contiguous: bool, 2025-05-07T20:32:42.7205672Z compiled: bool, 2025-05-07T20:32:42.7205758Z ) -> None: 2025-05-07T20:32:42.7205857Z torch.manual_seed(2025) 2025-05-07T20:32:42.7205933Z 2025-05-07T20:32:42.7206111Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.7206196Z 2025-05-07T20:32:42.7206294Z x_sign = torch.sign(x) 2025-05-07T20:32:42.7206420Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.7206516Z x = x_sign * x_clamp 2025-05-07T20:32:42.7206607Z x0 = x[:, :D] 2025-05-07T20:32:42.7206687Z x1 = x[:, D:] 2025-05-07T20:32:42.7206763Z 2025-05-07T20:32:42.7206854Z if contiguous: 2025-05-07T20:32:42.7206947Z x0 = x0.contiguous() 2025-05-07T20:32:42.7207037Z x1 = x1.contiguous() 2025-05-07T20:32:42.7207161Z 2025-05-07T20:32:42.7207252Z if scale_ub is not None: 2025-05-07T20:32:42.7207394Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.7207536Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.7207614Z ) 2025-05-07T20:32:42.7207693Z else: 2025-05-07T20:32:42.7207789Z scale_ub_tensor = None 2025-05-07T20:32:42.7207867Z 2025-05-07T20:32:42.7208040Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.7208130Z op = silu_mul_quant 2025-05-07T20:32:42.7208214Z if compiled: 2025-05-07T20:32:42.7208320Z op = torch.compile(op) 2025-05-07T20:32:42.7208426Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.7208496Z 2025-05-07T20:32:42.7208593Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.7208598Z 2025-05-07T20:32:42.7208693Z moe/activation_test.py:117: 2025-05-07T20:32:42.7208827Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.7208933Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.7209037Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.7209586Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.7209685Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.7210075Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.7210319Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.7210691Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.7210792Z kernel = self.compile( 2025-05-07T20:32:42.7211208Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.7211392Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.7211531Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.7211535Z 2025-05-07T20:32:42.7211750Z self = 2025-05-07T20:32:42.7212604Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.7213201Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3dd37d5280>} 2025-05-07T20:32:42.7214019Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.7214218Z context = 2025-05-07T20:32:42.7214225Z 2025-05-07T20:32:42.7214399Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.7214680Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.7214790Z module_map=module_map) 2025-05-07T20:32:42.7214959Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.7215064Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.7215140Z E ^ 2025-05-07T20:32:42.7215526Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.7215536Z 2025-05-07T20:32:42.7215981Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.7215985Z 2025-05-07T20:32:42.7216155Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.7216390Z self=, 2025-05-07T20:32:42.7216509Z T=2048, 2025-05-07T20:32:42.7216586Z D=5120, 2025-05-07T20:32:42.7216671Z scale_ub=None, 2025-05-07T20:32:42.7216756Z contiguous=True, 2025-05-07T20:32:42.7216839Z compiled=False, 2025-05-07T20:32:42.7216918Z ) 2025-05-07T20:32:42.7217145Z self = 2025-05-07T20:32:42.7217374Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:42.7217378Z 2025-05-07T20:32:42.7217452Z @given( 2025-05-07T20:32:42.7217574Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.7217677Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.7217794Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.7217911Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.7218031Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.7218106Z ) 2025-05-07T20:32:42.7218364Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.7218466Z def test_silu_mul_quant( 2025-05-07T20:32:42.7218545Z self, 2025-05-07T20:32:42.7218629Z T: int, 2025-05-07T20:32:42.7218711Z D: int, 2025-05-07T20:32:42.7218813Z scale_ub: Optional[float], 2025-05-07T20:32:42.7218918Z contiguous: bool, 2025-05-07T20:32:42.7219004Z compiled: bool, 2025-05-07T20:32:42.7219085Z ) -> None: 2025-05-07T20:32:42.7219188Z torch.manual_seed(2025) 2025-05-07T20:32:42.7219263Z 2025-05-07T20:32:42.7219439Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.7219522Z 2025-05-07T20:32:42.7219615Z > x_sign = torch.sign(x) 2025-05-07T20:32:42.7221587Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:42.7221599Z 2025-05-07T20:32:42.7221716Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:32:42.7221721Z 2025-05-07T20:32:42.7221820Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.7222101Z self=, 2025-05-07T20:32:42.7222188Z T=16384, 2025-05-07T20:32:42.7222270Z D=5120, 2025-05-07T20:32:42.7222362Z scale_ub=None, 2025-05-07T20:32:42.7222448Z contiguous=True, 2025-05-07T20:32:42.7222531Z compiled=False, 2025-05-07T20:32:42.7222617Z ) 2025-05-07T20:32:42.7222847Z self = 2025-05-07T20:32:42.7223032Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:42.7223037Z 2025-05-07T20:32:42.7223116Z @given( 2025-05-07T20:32:42.7223234Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.7223336Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.7223456Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.7223573Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.7223690Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.7223764Z ) 2025-05-07T20:32:42.7224022Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.7224120Z def test_silu_mul_quant( 2025-05-07T20:32:42.7224197Z self, 2025-05-07T20:32:42.7224275Z T: int, 2025-05-07T20:32:42.7224408Z D: int, 2025-05-07T20:32:42.7224507Z scale_ub: Optional[float], 2025-05-07T20:32:42.7224638Z contiguous: bool, 2025-05-07T20:32:42.7224725Z compiled: bool, 2025-05-07T20:32:42.7224806Z ) -> None: 2025-05-07T20:32:42.7224906Z torch.manual_seed(2025) 2025-05-07T20:32:42.7224982Z 2025-05-07T20:32:42.7225158Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.7227163Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:42.7227169Z 2025-05-07T20:32:42.7227284Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:42.7227291Z 2025-05-07T20:32:42.7227395Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.7227626Z self=, 2025-05-07T20:32:42.7227704Z T=4096, 2025-05-07T20:32:42.7227785Z D=5120, 2025-05-07T20:32:42.7227870Z scale_ub=None, 2025-05-07T20:32:42.7227965Z contiguous=True, 2025-05-07T20:32:42.7228049Z compiled=False, 2025-05-07T20:32:42.7228126Z ) 2025-05-07T20:32:42.7228357Z self = 2025-05-07T20:32:42.7228537Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:42.7228542Z 2025-05-07T20:32:42.7228621Z @given( 2025-05-07T20:32:42.7228746Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.7228849Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.7228967Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.7229089Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.7229205Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.7229286Z ) 2025-05-07T20:32:42.7229545Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.7229640Z def test_silu_mul_quant( 2025-05-07T20:32:42.7229831Z self, 2025-05-07T20:32:42.7229914Z T: int, 2025-05-07T20:32:42.7229993Z D: int, 2025-05-07T20:32:42.7230096Z scale_ub: Optional[float], 2025-05-07T20:32:42.7230186Z contiguous: bool, 2025-05-07T20:32:42.7230319Z compiled: bool, 2025-05-07T20:32:42.7230399Z ) -> None: 2025-05-07T20:32:42.7230493Z torch.manual_seed(2025) 2025-05-07T20:32:42.7230564Z 2025-05-07T20:32:42.7230740Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.7232686Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:42.7232696Z 2025-05-07T20:32:42.7232819Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:42.7232826Z 2025-05-07T20:32:42.7232931Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.7233169Z self=, 2025-05-07T20:32:42.7233248Z T=2048, 2025-05-07T20:32:42.7233327Z D=5120, 2025-05-07T20:32:42.7233415Z scale_ub=None, 2025-05-07T20:32:42.7233554Z contiguous=False, 2025-05-07T20:32:42.7233640Z compiled=False, 2025-05-07T20:32:42.7233754Z ) 2025-05-07T20:32:42.7233979Z self = 2025-05-07T20:32:42.7234158Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:42.7234163Z 2025-05-07T20:32:42.7234242Z @given( 2025-05-07T20:32:42.7234363Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.7234506Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.7234618Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.7234733Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.7234849Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.7234925Z ) 2025-05-07T20:32:42.7235182Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.7235279Z def test_silu_mul_quant( 2025-05-07T20:32:42.7235358Z self, 2025-05-07T20:32:42.7235440Z T: int, 2025-05-07T20:32:42.7235520Z D: int, 2025-05-07T20:32:42.7235622Z scale_ub: Optional[float], 2025-05-07T20:32:42.7235716Z contiguous: bool, 2025-05-07T20:32:42.7235802Z compiled: bool, 2025-05-07T20:32:42.7235883Z ) -> None: 2025-05-07T20:32:42.7235983Z torch.manual_seed(2025) 2025-05-07T20:32:42.7236061Z 2025-05-07T20:32:42.7236236Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.7238192Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:42.7238201Z 2025-05-07T20:32:42.7238316Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:42.7238320Z 2025-05-07T20:32:42.7238425Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.7238652Z self=, 2025-05-07T20:32:42.7238726Z T=4096, 2025-05-07T20:32:42.7238808Z D=7168, 2025-05-07T20:32:42.7238895Z scale_ub=None, 2025-05-07T20:32:42.7238983Z contiguous=True, 2025-05-07T20:32:42.7239074Z compiled=True, 2025-05-07T20:32:42.7239152Z ) 2025-05-07T20:32:42.7239430Z self = 2025-05-07T20:32:42.7239606Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:42.7239611Z 2025-05-07T20:32:42.7239686Z @given( 2025-05-07T20:32:42.7239810Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.7239915Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.7240034Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.7240156Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.7240270Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.7240349Z ) 2025-05-07T20:32:42.7240612Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.7240707Z def test_silu_mul_quant( 2025-05-07T20:32:42.7240788Z self, 2025-05-07T20:32:42.7240868Z T: int, 2025-05-07T20:32:42.7240947Z D: int, 2025-05-07T20:32:42.7241052Z scale_ub: Optional[float], 2025-05-07T20:32:42.7241142Z contiguous: bool, 2025-05-07T20:32:42.7241231Z compiled: bool, 2025-05-07T20:32:42.7241318Z ) -> None: 2025-05-07T20:32:42.7241414Z torch.manual_seed(2025) 2025-05-07T20:32:42.7241490Z 2025-05-07T20:32:42.7241715Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.7243666Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:42.7243744Z 2025-05-07T20:32:42.7243876Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:42.7243881Z 2025-05-07T20:32:42.7243989Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.7244251Z self=, 2025-05-07T20:32:42.7244330Z T=2048, 2025-05-07T20:32:42.7244408Z D=5120, 2025-05-07T20:32:42.7244501Z scale_ub=1200.0, 2025-05-07T20:32:42.7244590Z contiguous=False, 2025-05-07T20:32:42.7244679Z compiled=False, 2025-05-07T20:32:42.7244761Z ) 2025-05-07T20:32:42.7245012Z self = 2025-05-07T20:32:42.7245213Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:42.7245217Z 2025-05-07T20:32:42.7245303Z @given( 2025-05-07T20:32:42.7245428Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.7245536Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.7245662Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.7245785Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.7245906Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.7245984Z ) 2025-05-07T20:32:42.7246277Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.7246383Z def test_silu_mul_quant( 2025-05-07T20:32:42.7246461Z self, 2025-05-07T20:32:42.7246542Z T: int, 2025-05-07T20:32:42.7246627Z D: int, 2025-05-07T20:32:42.7246730Z scale_ub: Optional[float], 2025-05-07T20:32:42.7246825Z contiguous: bool, 2025-05-07T20:32:42.7246914Z compiled: bool, 2025-05-07T20:32:42.7246995Z ) -> None: 2025-05-07T20:32:42.7247097Z torch.manual_seed(2025) 2025-05-07T20:32:42.7247177Z 2025-05-07T20:32:42.7247363Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.7249745Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:42.7249755Z 2025-05-07T20:32:42.7249874Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:42.7249879Z 2025-05-07T20:32:42.7249981Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.7250211Z self=, 2025-05-07T20:32:42.7250290Z T=4096, 2025-05-07T20:32:42.7250376Z D=7168, 2025-05-07T20:32:42.7250460Z scale_ub=1200.0, 2025-05-07T20:32:42.7250551Z contiguous=True, 2025-05-07T20:32:42.7250640Z compiled=False, 2025-05-07T20:32:42.7250716Z ) 2025-05-07T20:32:42.7250951Z self = 2025-05-07T20:32:42.7251132Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:42.7251137Z 2025-05-07T20:32:42.7251212Z @given( 2025-05-07T20:32:42.7251381Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.7251517Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.7251631Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.7251748Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.7251860Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.7251935Z ) 2025-05-07T20:32:42.7252245Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.7252340Z def test_silu_mul_quant( 2025-05-07T20:32:42.7252423Z self, 2025-05-07T20:32:42.7252501Z T: int, 2025-05-07T20:32:42.7252585Z D: int, 2025-05-07T20:32:42.7252690Z scale_ub: Optional[float], 2025-05-07T20:32:42.7252781Z contiguous: bool, 2025-05-07T20:32:42.7252867Z compiled: bool, 2025-05-07T20:32:42.7252949Z ) -> None: 2025-05-07T20:32:42.7253045Z torch.manual_seed(2025) 2025-05-07T20:32:42.7253124Z 2025-05-07T20:32:42.7253302Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.7255258Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:42.7255266Z 2025-05-07T20:32:42.7255389Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:42.7255393Z 2025-05-07T20:32:42.7255495Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.7255731Z self=, 2025-05-07T20:32:42.7255815Z T=16384, 2025-05-07T20:32:42.7255894Z D=7168, 2025-05-07T20:32:42.7255983Z scale_ub=None, 2025-05-07T20:32:42.7256069Z contiguous=False, 2025-05-07T20:32:42.7256154Z compiled=True, 2025-05-07T20:32:42.7256233Z ) 2025-05-07T20:32:42.7256460Z self = 2025-05-07T20:32:42.7256646Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:42.7256656Z 2025-05-07T20:32:42.7256735Z @given( 2025-05-07T20:32:42.7256855Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.7257003Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.7257119Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.7257237Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.7257352Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.7257426Z ) 2025-05-07T20:32:42.7257687Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.7257793Z def test_silu_mul_quant( 2025-05-07T20:32:42.7257872Z self, 2025-05-07T20:32:42.7257953Z T: int, 2025-05-07T20:32:42.7258034Z D: int, 2025-05-07T20:32:42.7258136Z scale_ub: Optional[float], 2025-05-07T20:32:42.7258229Z contiguous: bool, 2025-05-07T20:32:42.7258318Z compiled: bool, 2025-05-07T20:32:42.7258399Z ) -> None: 2025-05-07T20:32:42.7258499Z torch.manual_seed(2025) 2025-05-07T20:32:42.7258574Z 2025-05-07T20:32:42.7258748Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.7260746Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:42.7260787Z 2025-05-07T20:32:42.7260903Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:42.7260907Z 2025-05-07T20:32:42.7261015Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.7261283Z self=, 2025-05-07T20:32:42.7261357Z T=4096, 2025-05-07T20:32:42.7261440Z D=7168, 2025-05-07T20:32:42.7261528Z scale_ub=None, 2025-05-07T20:32:42.7261624Z contiguous=True, 2025-05-07T20:32:42.7261711Z compiled=False, 2025-05-07T20:32:42.7261788Z ) 2025-05-07T20:32:42.7262020Z self = 2025-05-07T20:32:42.7262196Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:42.7262203Z 2025-05-07T20:32:42.7262283Z @given( 2025-05-07T20:32:42.7262412Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.7262514Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.7262629Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.7262751Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.7262867Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.7262952Z ) 2025-05-07T20:32:42.7263215Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.7263311Z def test_silu_mul_quant( 2025-05-07T20:32:42.7263399Z self, 2025-05-07T20:32:42.7263478Z T: int, 2025-05-07T20:32:42.7263556Z D: int, 2025-05-07T20:32:42.7263660Z scale_ub: Optional[float], 2025-05-07T20:32:42.7263751Z contiguous: bool, 2025-05-07T20:32:42.7263836Z compiled: bool, 2025-05-07T20:32:42.7263920Z ) -> None: 2025-05-07T20:32:42.7264018Z torch.manual_seed(2025) 2025-05-07T20:32:42.7264097Z 2025-05-07T20:32:42.7264272Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.7266270Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:42.7266279Z 2025-05-07T20:32:42.7266403Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:42.7266408Z 2025-05-07T20:32:42.7266511Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.7266751Z self=, 2025-05-07T20:32:42.7266833Z T=16384, 2025-05-07T20:32:42.7266911Z D=7168, 2025-05-07T20:32:42.7267001Z scale_ub=None, 2025-05-07T20:32:42.7267086Z contiguous=True, 2025-05-07T20:32:42.7267175Z compiled=False, 2025-05-07T20:32:42.7267250Z ) 2025-05-07T20:32:42.7267479Z self = 2025-05-07T20:32:42.7267662Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:42.7267669Z 2025-05-07T20:32:42.7267745Z @given( 2025-05-07T20:32:42.7267868Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.7267973Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.7268088Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.7268208Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.7268322Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.7268443Z ) 2025-05-07T20:32:42.7268704Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.7268837Z def test_silu_mul_quant( 2025-05-07T20:32:42.7268913Z self, 2025-05-07T20:32:42.7268987Z T: int, 2025-05-07T20:32:42.7269068Z D: int, 2025-05-07T20:32:42.7269169Z scale_ub: Optional[float], 2025-05-07T20:32:42.7269262Z contiguous: bool, 2025-05-07T20:32:42.7269392Z compiled: bool, 2025-05-07T20:32:42.7269470Z ) -> None: 2025-05-07T20:32:42.7269575Z torch.manual_seed(2025) 2025-05-07T20:32:42.7269648Z 2025-05-07T20:32:42.7269907Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.7271865Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:42.7271873Z 2025-05-07T20:32:42.7271992Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:42.7272001Z 2025-05-07T20:32:42.7272108Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.7272339Z self=, 2025-05-07T20:32:42.7272420Z T=16384, 2025-05-07T20:32:42.7272506Z D=7168, 2025-05-07T20:32:42.7272588Z scale_ub=1200.0, 2025-05-07T20:32:42.7272676Z contiguous=True, 2025-05-07T20:32:42.7272762Z compiled=False, 2025-05-07T20:32:42.7272834Z ) 2025-05-07T20:32:42.7273064Z self = 2025-05-07T20:32:42.7273250Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:42.7273257Z 2025-05-07T20:32:42.7273337Z @given( 2025-05-07T20:32:42.7273459Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.7273557Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.7273670Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.7273790Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.7273908Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.7273988Z ) 2025-05-07T20:32:42.7274294Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.7274389Z def test_silu_mul_quant( 2025-05-07T20:32:42.7274472Z self, 2025-05-07T20:32:42.7274551Z T: int, 2025-05-07T20:32:42.7274630Z D: int, 2025-05-07T20:32:42.7274735Z scale_ub: Optional[float], 2025-05-07T20:32:42.7274825Z contiguous: bool, 2025-05-07T20:32:42.7274914Z compiled: bool, 2025-05-07T20:32:42.7274997Z ) -> None: 2025-05-07T20:32:42.7275098Z torch.manual_seed(2025) 2025-05-07T20:32:42.7275174Z 2025-05-07T20:32:42.7275350Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.7277304Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:42.7277312Z 2025-05-07T20:32:42.7277434Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:42.7277438Z 2025-05-07T20:32:42.7277597Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.7277865Z self=, 2025-05-07T20:32:42.7277942Z T=128, 2025-05-07T20:32:42.7278017Z D=5120, 2025-05-07T20:32:42.7278101Z scale_ub=1200.0, 2025-05-07T20:32:42.7278185Z contiguous=False, 2025-05-07T20:32:42.7278269Z compiled=False, 2025-05-07T20:32:42.7278347Z ) 2025-05-07T20:32:42.7278637Z self = 2025-05-07T20:32:42.7278816Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:42.7278824Z 2025-05-07T20:32:42.7278907Z @given( 2025-05-07T20:32:42.7279027Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.7279130Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.7279247Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.7279366Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.7279488Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.7279569Z ) 2025-05-07T20:32:42.7279829Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.7279930Z def test_silu_mul_quant( 2025-05-07T20:32:42.7280011Z self, 2025-05-07T20:32:42.7280091Z T: int, 2025-05-07T20:32:42.7280174Z D: int, 2025-05-07T20:32:42.7280278Z scale_ub: Optional[float], 2025-05-07T20:32:42.7280378Z contiguous: bool, 2025-05-07T20:32:42.7280470Z compiled: bool, 2025-05-07T20:32:42.7280555Z ) -> None: 2025-05-07T20:32:42.7280661Z torch.manual_seed(2025) 2025-05-07T20:32:42.7280741Z 2025-05-07T20:32:42.7280918Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.7281000Z 2025-05-07T20:32:42.7281096Z x_sign = torch.sign(x) 2025-05-07T20:32:42.7281228Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.7281325Z x = x_sign * x_clamp 2025-05-07T20:32:42.7281411Z x0 = x[:, :D] 2025-05-07T20:32:42.7281501Z x1 = x[:, D:] 2025-05-07T20:32:42.7281582Z 2025-05-07T20:32:42.7281669Z if contiguous: 2025-05-07T20:32:42.7281765Z x0 = x0.contiguous() 2025-05-07T20:32:42.7281862Z x1 = x1.contiguous() 2025-05-07T20:32:42.7281938Z 2025-05-07T20:32:42.7282035Z if scale_ub is not None: 2025-05-07T20:32:42.7282148Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.7282290Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.7282374Z ) 2025-05-07T20:32:42.7282503Z else: 2025-05-07T20:32:42.7282601Z scale_ub_tensor = None 2025-05-07T20:32:42.7282678Z 2025-05-07T20:32:42.7282985Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.7283114Z op = silu_mul_quant 2025-05-07T20:32:42.7283233Z if compiled: 2025-05-07T20:32:42.7283343Z op = torch.compile(op) 2025-05-07T20:32:42.7283455Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.7283539Z 2025-05-07T20:32:42.7283635Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.7283639Z 2025-05-07T20:32:42.7283742Z moe/activation_test.py:117: 2025-05-07T20:32:42.7283881Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.7283987Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.7284101Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.7284652Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.7284757Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.7285153Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.7285392Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.7285855Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.7286004Z kernel = self.compile( 2025-05-07T20:32:42.7286421Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.7286614Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.7286809Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.7286813Z 2025-05-07T20:32:42.7287034Z self = 2025-05-07T20:32:42.7287891Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.7288455Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3dd34a6940>} 2025-05-07T20:32:42.7293117Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.7293328Z context = 2025-05-07T20:32:42.7293338Z 2025-05-07T20:32:42.7293520Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.7293806Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.7293913Z module_map=module_map) 2025-05-07T20:32:42.7294082Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.7294181Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.7294262Z E ^ 2025-05-07T20:32:42.7294651Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.7294658Z 2025-05-07T20:32:42.7295104Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.7295109Z 2025-05-07T20:32:42.7295218Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.7295450Z self=, 2025-05-07T20:32:42.7295532Z T=2048, 2025-05-07T20:32:42.7295608Z D=7168, 2025-05-07T20:32:42.7295693Z scale_ub=None, 2025-05-07T20:32:42.7295874Z contiguous=False, 2025-05-07T20:32:42.7295958Z compiled=False, 2025-05-07T20:32:42.7296034Z ) 2025-05-07T20:32:42.7296263Z self = 2025-05-07T20:32:42.7296442Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:42.7296446Z 2025-05-07T20:32:42.7296525Z @given( 2025-05-07T20:32:42.7296645Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.7296745Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.7296860Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.7296978Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.7297091Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.7297169Z ) 2025-05-07T20:32:42.7297431Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.7297523Z def test_silu_mul_quant( 2025-05-07T20:32:42.7297606Z self, 2025-05-07T20:32:42.7297685Z T: int, 2025-05-07T20:32:42.7297766Z D: int, 2025-05-07T20:32:42.7297867Z scale_ub: Optional[float], 2025-05-07T20:32:42.7297953Z contiguous: bool, 2025-05-07T20:32:42.7298036Z compiled: bool, 2025-05-07T20:32:42.7298117Z ) -> None: 2025-05-07T20:32:42.7298270Z torch.manual_seed(2025) 2025-05-07T20:32:42.7298347Z 2025-05-07T20:32:42.7298529Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.7300537Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:42.7300586Z 2025-05-07T20:32:42.7300705Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:42.7300710Z 2025-05-07T20:32:42.7300814Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.7301050Z self=, 2025-05-07T20:32:42.7301129Z T=128, 2025-05-07T20:32:42.7301203Z D=7168, 2025-05-07T20:32:42.7301291Z scale_ub=1200.0, 2025-05-07T20:32:42.7301376Z contiguous=True, 2025-05-07T20:32:42.7301459Z compiled=True, 2025-05-07T20:32:42.7301534Z ) 2025-05-07T20:32:42.7301760Z self = 2025-05-07T20:32:42.7301938Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:42.7301946Z 2025-05-07T20:32:42.7302022Z @given( 2025-05-07T20:32:42.7302140Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.7302247Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.7302362Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.7302478Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.7302597Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.7302668Z ) 2025-05-07T20:32:42.7302929Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.7303030Z def test_silu_mul_quant( 2025-05-07T20:32:42.7303112Z self, 2025-05-07T20:32:42.7303198Z T: int, 2025-05-07T20:32:42.7303279Z D: int, 2025-05-07T20:32:42.7303380Z scale_ub: Optional[float], 2025-05-07T20:32:42.7303473Z contiguous: bool, 2025-05-07T20:32:42.7303560Z compiled: bool, 2025-05-07T20:32:42.7303643Z ) -> None: 2025-05-07T20:32:42.7303742Z torch.manual_seed(2025) 2025-05-07T20:32:42.7303818Z 2025-05-07T20:32:42.7304040Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.7304122Z 2025-05-07T20:32:42.7304214Z x_sign = torch.sign(x) 2025-05-07T20:32:42.7304337Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.7304430Z x = x_sign * x_clamp 2025-05-07T20:32:42.7304511Z x0 = x[:, :D] 2025-05-07T20:32:42.7304593Z x1 = x[:, D:] 2025-05-07T20:32:42.7304668Z 2025-05-07T20:32:42.7304749Z if contiguous: 2025-05-07T20:32:42.7304846Z x0 = x0.contiguous() 2025-05-07T20:32:42.7304934Z x1 = x1.contiguous() 2025-05-07T20:32:42.7305008Z 2025-05-07T20:32:42.7305100Z if scale_ub is not None: 2025-05-07T20:32:42.7305204Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.7305343Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.7305420Z ) 2025-05-07T20:32:42.7305497Z else: 2025-05-07T20:32:42.7305594Z scale_ub_tensor = None 2025-05-07T20:32:42.7305669Z 2025-05-07T20:32:42.7305801Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.7305893Z op = silu_mul_quant 2025-05-07T20:32:42.7305977Z if compiled: 2025-05-07T20:32:42.7306075Z op = torch.compile(op) 2025-05-07T20:32:42.7306184Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.7306301Z 2025-05-07T20:32:42.7306397Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.7306442Z 2025-05-07T20:32:42.7306546Z moe/activation_test.py:117: 2025-05-07T20:32:42.7306685Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.7306793Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.7306893Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.7307294Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:42.7307429Z return fn(*args, **kwargs) 2025-05-07T20:32:42.7307970Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.7308070Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.7308458Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.7308697Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.7309067Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.7309163Z kernel = self.compile( 2025-05-07T20:32:42.7309576Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.7309883Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.7310023Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.7310027Z 2025-05-07T20:32:42.7310247Z self = 2025-05-07T20:32:42.7311109Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.7311668Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3dd3498940>} 2025-05-07T20:32:42.7312493Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.7312696Z context = 2025-05-07T20:32:42.7312700Z 2025-05-07T20:32:42.7312876Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.7313202Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.7313312Z module_map=module_map) 2025-05-07T20:32:42.7313482Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.7313580Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.7313659Z E ^ 2025-05-07T20:32:42.7314039Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.7314046Z 2025-05-07T20:32:42.7314492Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.7314496Z 2025-05-07T20:32:42.7314599Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.7314836Z self=, 2025-05-07T20:32:42.7314914Z T=128, 2025-05-07T20:32:42.7314998Z D=7168, 2025-05-07T20:32:42.7315084Z scale_ub=1200.0, 2025-05-07T20:32:42.7315173Z contiguous=True, 2025-05-07T20:32:42.7315259Z compiled=False, 2025-05-07T20:32:42.7315334Z ) 2025-05-07T20:32:42.7315565Z self = 2025-05-07T20:32:42.7315787Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:42.7315792Z 2025-05-07T20:32:42.7315871Z @given( 2025-05-07T20:32:42.7316066Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.7316164Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.7316277Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.7316394Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.7316507Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.7316626Z ) 2025-05-07T20:32:42.7316885Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.7316983Z def test_silu_mul_quant( 2025-05-07T20:32:42.7317064Z self, 2025-05-07T20:32:42.7317142Z T: int, 2025-05-07T20:32:42.7317220Z D: int, 2025-05-07T20:32:42.7317323Z scale_ub: Optional[float], 2025-05-07T20:32:42.7317416Z contiguous: bool, 2025-05-07T20:32:42.7317503Z compiled: bool, 2025-05-07T20:32:42.7317590Z ) -> None: 2025-05-07T20:32:42.7317687Z torch.manual_seed(2025) 2025-05-07T20:32:42.7317767Z 2025-05-07T20:32:42.7317949Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.7318029Z 2025-05-07T20:32:42.7318130Z x_sign = torch.sign(x) 2025-05-07T20:32:42.7318257Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.7320220Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:42.7320233Z 2025-05-07T20:32:42.7320349Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:42.7320355Z 2025-05-07T20:32:42.7320460Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.7320692Z self=, 2025-05-07T20:32:42.7320773Z T=128, 2025-05-07T20:32:42.7320854Z D=5120, 2025-05-07T20:32:42.7320942Z scale_ub=1200.0, 2025-05-07T20:32:42.7321029Z contiguous=True, 2025-05-07T20:32:42.7321116Z compiled=True, 2025-05-07T20:32:42.7321196Z ) 2025-05-07T20:32:42.7321422Z self = 2025-05-07T20:32:42.7321644Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:42.7321649Z 2025-05-07T20:32:42.7321726Z @given( 2025-05-07T20:32:42.7321843Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.7321944Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.7322059Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.7322173Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.7322292Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.7322365Z ) 2025-05-07T20:32:42.7322624Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.7322727Z def test_silu_mul_quant( 2025-05-07T20:32:42.7322807Z self, 2025-05-07T20:32:42.7322890Z T: int, 2025-05-07T20:32:42.7322969Z D: int, 2025-05-07T20:32:42.7323069Z scale_ub: Optional[float], 2025-05-07T20:32:42.7323161Z contiguous: bool, 2025-05-07T20:32:42.7323249Z compiled: bool, 2025-05-07T20:32:42.7323330Z ) -> None: 2025-05-07T20:32:42.7323428Z torch.manual_seed(2025) 2025-05-07T20:32:42.7323502Z 2025-05-07T20:32:42.7323677Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.7323761Z 2025-05-07T20:32:42.7323855Z x_sign = torch.sign(x) 2025-05-07T20:32:42.7324026Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.7326011Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:42.7326054Z 2025-05-07T20:32:42.7326171Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:42.7326179Z 2025-05-07T20:32:42.7326279Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.7326508Z self=, 2025-05-07T20:32:42.7326593Z T=128, 2025-05-07T20:32:42.7326675Z D=7168, 2025-05-07T20:32:42.7326764Z scale_ub=None, 2025-05-07T20:32:42.7326857Z contiguous=True, 2025-05-07T20:32:42.7326943Z compiled=True, 2025-05-07T20:32:42.7327021Z ) 2025-05-07T20:32:42.7327256Z self = 2025-05-07T20:32:42.7327435Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:42.7327443Z 2025-05-07T20:32:42.7327527Z @given( 2025-05-07T20:32:42.7327649Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.7327749Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.7327871Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.7327989Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.7328105Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.7328183Z ) 2025-05-07T20:32:42.7328446Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.7328539Z def test_silu_mul_quant( 2025-05-07T20:32:42.7328622Z self, 2025-05-07T20:32:42.7328700Z T: int, 2025-05-07T20:32:42.7328776Z D: int, 2025-05-07T20:32:42.7328879Z scale_ub: Optional[float], 2025-05-07T20:32:42.7328970Z contiguous: bool, 2025-05-07T20:32:42.7329058Z compiled: bool, 2025-05-07T20:32:42.7329139Z ) -> None: 2025-05-07T20:32:42.7329236Z torch.manual_seed(2025) 2025-05-07T20:32:42.7329317Z 2025-05-07T20:32:42.7329491Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.7331489Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:42.7331501Z 2025-05-07T20:32:42.7331620Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:42.7331758Z =============================== warnings summary =============================== 2025-05-07T20:32:42.7332088Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:32:42.7332409Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:32:42.7332727Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:32:42.7333728Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:108: DeprecationWarning: warmup, rep, and use_cuda_graph parameters are deprecated. See https://github.com/triton-lang/triton/pull/4496 for details. 2025-05-07T20:32:42.7334005Z warnings.warn(("warmup, rep, and use_cuda_graph parameters are deprecated. See " 2025-05-07T20:32:42.7334009Z 2025-05-07T20:32:42.7334236Z -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html 2025-05-07T20:32:42.7334414Z ================= 1 failed, 1 deselected, 3 warnings in 24.06s ================= 2025-05-07T20:32:44.3671760Z ERROR conda.cli.main_run:execute(125): `conda run python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py` failed. (See above for error) 2025-05-07T20:32:44.4297609Z [EXEC] [ATTEMPT 0/2] Command attempt failed. 2025-05-07T20:32:44.4297943Z 2025-05-07T20:32:46.4317459Z [EXEC] [ATTEMPT 1/2] + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py 2025-05-07T20:32:48.5854384Z ============================= test session starts ============================== 2025-05-07T20:32:48.5855071Z platform linux -- Python 3.9.18, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python 2025-05-07T20:32:48.5855625Z cachedir: .pytest_cache 2025-05-07T20:32:48.5856237Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,) 2025-05-07T20:32:48.5857009Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu 2025-05-07T20:32:48.5857437Z plugins: hypothesis-6.131.14 2025-05-07T20:32:50.2045264Z TMA benchmarks will be running with experimental grid constant TMA descriptor. 2025-05-07T20:32:50.4172849Z collecting ... collected 2 items / 1 deselected / 1 selected 2025-05-07T20:32:50.4173284Z run-last-failure: rerun previous 1 failure 2025-05-07T20:32:50.4173513Z 2025-05-07T20:32:53.1083531Z moe/activation_test.py::ActivationTests::test_silu_mul_quant Trying example: test_silu_mul_quant( 2025-05-07T20:32:53.1084287Z self=, 2025-05-07T20:32:53.1084748Z T=1, 2025-05-07T20:32:53.1084947Z D=5120, 2025-05-07T20:32:53.1085150Z scale_ub=None, 2025-05-07T20:32:53.1085378Z contiguous=True, 2025-05-07T20:32:53.1085625Z compiled=True, 2025-05-07T20:32:53.1085842Z ) 2025-05-07T20:32:53.1086181Z self = 2025-05-07T20:32:53.1086704Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:53.1086992Z 2025-05-07T20:32:53.1087074Z @given( 2025-05-07T20:32:53.1087597Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:53.1087934Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:53.1088249Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:53.1088598Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:53.1088950Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:53.1089273Z ) 2025-05-07T20:32:53.1089671Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:53.1090154Z def test_silu_mul_quant( 2025-05-07T20:32:53.1090404Z self, 2025-05-07T20:32:53.1090607Z T: int, 2025-05-07T20:32:53.1090816Z D: int, 2025-05-07T20:32:53.1091037Z scale_ub: Optional[float], 2025-05-07T20:32:53.1091326Z contiguous: bool, 2025-05-07T20:32:53.1091578Z compiled: bool, 2025-05-07T20:32:53.1091821Z ) -> None: 2025-05-07T20:32:53.1092046Z torch.manual_seed(2025) 2025-05-07T20:32:53.1092311Z 2025-05-07T20:32:53.1092600Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:53.1092966Z 2025-05-07T20:32:53.1093173Z x_sign = torch.sign(x) 2025-05-07T20:32:53.1093481Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:53.1093896Z x = x_sign * x_clamp 2025-05-07T20:32:53.1094149Z x0 = x[:, :D] 2025-05-07T20:32:53.1094444Z x1 = x[:, D:] 2025-05-07T20:32:53.1094655Z 2025-05-07T20:32:53.1094848Z if contiguous: 2025-05-07T20:32:53.1095086Z x0 = x0.contiguous() 2025-05-07T20:32:53.1095348Z x1 = x1.contiguous() 2025-05-07T20:32:53.1095602Z 2025-05-07T20:32:53.1095800Z if scale_ub is not None: 2025-05-07T20:32:53.1096165Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:53.1096515Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:53.1096844Z ) 2025-05-07T20:32:53.1097037Z else: 2025-05-07T20:32:53.1097258Z scale_ub_tensor = None 2025-05-07T20:32:53.1097519Z 2025-05-07T20:32:53.1097756Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:53.1098079Z op = silu_mul_quant 2025-05-07T20:32:53.1098344Z if compiled: 2025-05-07T20:32:53.1098604Z op = torch.compile(op) 2025-05-07T20:32:53.1098911Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:53.1099256Z 2025-05-07T20:32:53.1099461Z y_fp8, y_scale = fn() 2025-05-07T20:32:53.1099752Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:53.1100062Z 2025-05-07T20:32:53.1100308Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:53.1100655Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:53.1100970Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:53.1101307Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:53.1101684Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:53.1102005Z 2025-05-07T20:32:53.1102208Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:53.1102410Z 2025-05-07T20:32:53.1102516Z moe/activation_test.py:126: 2025-05-07T20:32:53.1102816Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:53.1103172Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:53.1103514Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:53.1104366Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:53.1105190Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:53.1105773Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:53.1106510Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:53.1107302Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:53.1108082Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:53.1108892Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:53.1109693Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:53.1110646Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:53.1111332Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:53.1111974Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:53.1112525Z fn() 2025-05-07T20:32:53.1113066Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:53.1113691Z self.fn.run( 2025-05-07T20:32:53.1114180Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:53.1114742Z kernel = self.compile( 2025-05-07T20:32:53.1115367Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:53.1116112Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:53.1116522Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:53.1116774Z 2025-05-07T20:32:53.1116988Z self = 2025-05-07T20:32:53.1118178Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:53.1119757Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd1bbfdc9d0>} 2025-05-07T20:32:53.1121282Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:53.1122389Z context = 2025-05-07T20:32:53.1122700Z 2025-05-07T20:32:53.1122870Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:53.1123418Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:53.1123918Z module_map=module_map) 2025-05-07T20:32:53.1124293Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:53.1124667Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:53.1124944Z E ^ 2025-05-07T20:32:53.1125439Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:53.1125940Z 2025-05-07T20:32:53.1126391Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:53.1126957Z 2025-05-07T20:32:53.1127063Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:53.1127499Z self=, 2025-05-07T20:32:53.1127920Z T=2048, 2025-05-07T20:32:53.1128115Z D=5120, 2025-05-07T20:32:53.1128311Z scale_ub=1200.0, 2025-05-07T20:32:53.1128531Z contiguous=True, 2025-05-07T20:32:53.1128767Z compiled=False, 2025-05-07T20:32:53.1128978Z ) 2025-05-07T20:32:54.6256795Z self = 2025-05-07T20:32:54.6257653Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:54.6257960Z 2025-05-07T20:32:54.6258041Z @given( 2025-05-07T20:32:54.6258280Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.6258599Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.6258943Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.6259295Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.6259638Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.6259929Z ) 2025-05-07T20:32:54.6260289Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.6260752Z def test_silu_mul_quant( 2025-05-07T20:32:54.6260998Z self, 2025-05-07T20:32:54.6261190Z T: int, 2025-05-07T20:32:54.6261384Z D: int, 2025-05-07T20:32:54.6261602Z scale_ub: Optional[float], 2025-05-07T20:32:54.6261878Z contiguous: bool, 2025-05-07T20:32:54.6262113Z compiled: bool, 2025-05-07T20:32:54.6262345Z ) -> None: 2025-05-07T20:32:54.6262564Z torch.manual_seed(2025) 2025-05-07T20:32:54.6262801Z 2025-05-07T20:32:54.6263076Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.6263437Z 2025-05-07T20:32:54.6263626Z x_sign = torch.sign(x) 2025-05-07T20:32:54.6264006Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.6264400Z x = x_sign * x_clamp 2025-05-07T20:32:54.6264650Z x0 = x[:, :D] 2025-05-07T20:32:54.6264861Z x1 = x[:, D:] 2025-05-07T20:32:54.6265069Z 2025-05-07T20:32:54.6265257Z if contiguous: 2025-05-07T20:32:54.6265487Z x0 = x0.contiguous() 2025-05-07T20:32:54.6265753Z x1 = x1.contiguous() 2025-05-07T20:32:54.6266089Z 2025-05-07T20:32:54.6266281Z if scale_ub is not None: 2025-05-07T20:32:54.6266563Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.6266910Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.6267222Z ) 2025-05-07T20:32:54.6267413Z else: 2025-05-07T20:32:54.6267627Z scale_ub_tensor = None 2025-05-07T20:32:54.6267874Z 2025-05-07T20:32:54.6268104Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.6268432Z op = silu_mul_quant 2025-05-07T20:32:54.6268679Z if compiled: 2025-05-07T20:32:54.6268937Z op = torch.compile(op) 2025-05-07T20:32:54.6269239Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.6269519Z 2025-05-07T20:32:54.6269703Z > y_fp8, y_scale = fn() 2025-05-07T20:32:54.6269995Z 2025-05-07T20:32:54.6270095Z moe/activation_test.py:117: 2025-05-07T20:32:54.6270401Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.6270747Z moe/activation_test.py:115: in fn 2025-05-07T20:32:54.6271037Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.6271786Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:54.6272529Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:54.6273098Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.6273835Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.6274548Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.6275114Z kernel = self.compile( 2025-05-07T20:32:54.6275687Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.6276393Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.6276809Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.6277049Z 2025-05-07T20:32:54.6277314Z self = 2025-05-07T20:32:54.6278498Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.6280029Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd1bb06cdc0>} 2025-05-07T20:32:54.6281510Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.6282621Z context = 2025-05-07T20:32:54.6283097Z 2025-05-07T20:32:54.6283268Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.6283824Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.6284316Z module_map=module_map) 2025-05-07T20:32:54.6284689Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.6285130Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:54.6285402Z E ^ 2025-05-07T20:32:54.6285970Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.6286465Z 2025-05-07T20:32:54.6286914Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.6287477Z 2025-05-07T20:32:54.6287639Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.6288069Z self=, 2025-05-07T20:32:54.6288483Z T=2048, 2025-05-07T20:32:54.6288677Z D=5120, 2025-05-07T20:32:54.6288877Z scale_ub=1200.0, 2025-05-07T20:32:54.6289097Z contiguous=True, 2025-05-07T20:32:54.6289319Z compiled=True, 2025-05-07T20:32:54.6289550Z ) 2025-05-07T20:32:54.6289894Z self = 2025-05-07T20:32:54.6290415Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:54.6290709Z 2025-05-07T20:32:54.6290789Z @given( 2025-05-07T20:32:54.6291033Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.6291354Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.6291674Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.6300115Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.6300531Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.6300837Z ) 2025-05-07T20:32:54.6301207Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.6301682Z def test_silu_mul_quant( 2025-05-07T20:32:54.6301932Z self, 2025-05-07T20:32:54.6302133Z T: int, 2025-05-07T20:32:54.6302331Z D: int, 2025-05-07T20:32:54.6302558Z scale_ub: Optional[float], 2025-05-07T20:32:54.6302844Z contiguous: bool, 2025-05-07T20:32:54.6303085Z compiled: bool, 2025-05-07T20:32:54.6303320Z ) -> None: 2025-05-07T20:32:54.6303541Z torch.manual_seed(2025) 2025-05-07T20:32:54.6303789Z 2025-05-07T20:32:54.6304070Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.6304434Z 2025-05-07T20:32:54.6304622Z x_sign = torch.sign(x) 2025-05-07T20:32:54.6304926Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.6305252Z x = x_sign * x_clamp 2025-05-07T20:32:54.6305502Z x0 = x[:, :D] 2025-05-07T20:32:54.6305726Z x1 = x[:, D:] 2025-05-07T20:32:54.6305942Z 2025-05-07T20:32:54.6306122Z if contiguous: 2025-05-07T20:32:54.6306469Z x0 = x0.contiguous() 2025-05-07T20:32:54.6306742Z x1 = x1.contiguous() 2025-05-07T20:32:54.6306993Z 2025-05-07T20:32:54.6307182Z if scale_ub is not None: 2025-05-07T20:32:54.6307465Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.6307815Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.6308132Z ) 2025-05-07T20:32:54.6308332Z else: 2025-05-07T20:32:54.6308545Z scale_ub_tensor = None 2025-05-07T20:32:54.6308795Z 2025-05-07T20:32:54.6309036Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.6309365Z op = silu_mul_quant 2025-05-07T20:32:54.6309626Z if compiled: 2025-05-07T20:32:54.6309982Z op = torch.compile(op) 2025-05-07T20:32:54.6310287Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.6310579Z 2025-05-07T20:32:54.6310770Z y_fp8, y_scale = fn() 2025-05-07T20:32:54.6311060Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:54.6311362Z 2025-05-07T20:32:54.6311602Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.6311952Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:54.6312250Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:54.6312625Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:54.6313040Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:54.6313357Z 2025-05-07T20:32:54.6313560Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:54.6313763Z 2025-05-07T20:32:54.6313876Z moe/activation_test.py:126: 2025-05-07T20:32:54.6314176Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.6314569Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:54.6314904Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:54.6315757Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:54.6316562Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:54.6317141Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.6317880Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.6318612Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:54.6319388Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:54.6320222Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:54.6321054Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:54.6321834Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:54.6322520Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:54.6323163Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:54.6323719Z fn() 2025-05-07T20:32:54.6324255Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:54.6324883Z self.fn.run( 2025-05-07T20:32:54.6325374Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.6325936Z kernel = self.compile( 2025-05-07T20:32:54.6326510Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.6327211Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.6327672Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.6327915Z 2025-05-07T20:32:54.6328128Z self = 2025-05-07T20:32:54.6329305Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.6330819Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd1baa53550>} 2025-05-07T20:32:54.6332291Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.6333407Z context = 2025-05-07T20:32:54.6333715Z 2025-05-07T20:32:54.6333885Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.6334433Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.6334924Z module_map=module_map) 2025-05-07T20:32:54.6335336Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.6335738Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:54.6336013Z E ^ 2025-05-07T20:32:54.6336505Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.6336993Z 2025-05-07T20:32:54.6337439Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.6338038Z 2025-05-07T20:32:54.6338139Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.6338567Z self=, 2025-05-07T20:32:54.6338983Z T=16384, 2025-05-07T20:32:54.6339179Z D=7168, 2025-05-07T20:32:54.6339371Z scale_ub=1200.0, 2025-05-07T20:32:54.6339594Z contiguous=False, 2025-05-07T20:32:54.6339817Z compiled=False, 2025-05-07T20:32:54.6340023Z ) 2025-05-07T20:32:55.9595542Z self = 2025-05-07T20:32:55.9596199Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:55.9596703Z 2025-05-07T20:32:55.9596796Z @given( 2025-05-07T20:32:55.9597049Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:55.9597389Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:55.9597710Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:55.9598070Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:55.9598423Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:55.9598719Z ) 2025-05-07T20:32:55.9599106Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:55.9599598Z def test_silu_mul_quant( 2025-05-07T20:32:55.9600022Z self, 2025-05-07T20:32:55.9600218Z T: int, 2025-05-07T20:32:55.9600417Z D: int, 2025-05-07T20:32:55.9600633Z scale_ub: Optional[float], 2025-05-07T20:32:55.9600921Z contiguous: bool, 2025-05-07T20:32:55.9601174Z compiled: bool, 2025-05-07T20:32:55.9601403Z ) -> None: 2025-05-07T20:32:55.9601622Z torch.manual_seed(2025) 2025-05-07T20:32:55.9601871Z 2025-05-07T20:32:55.9602155Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:55.9602516Z 2025-05-07T20:32:55.9602712Z x_sign = torch.sign(x) 2025-05-07T20:32:55.9603015Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:55.9603336Z x = x_sign * x_clamp 2025-05-07T20:32:55.9603584Z x0 = x[:, :D] 2025-05-07T20:32:55.9603804Z x1 = x[:, D:] 2025-05-07T20:32:55.9604143Z 2025-05-07T20:32:55.9604335Z if contiguous: 2025-05-07T20:32:55.9604573Z x0 = x0.contiguous() 2025-05-07T20:32:55.9604834Z x1 = x1.contiguous() 2025-05-07T20:32:55.9605083Z 2025-05-07T20:32:55.9605282Z if scale_ub is not None: 2025-05-07T20:32:55.9605559Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:55.9605915Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:55.9606245Z ) 2025-05-07T20:32:55.9606435Z else: 2025-05-07T20:32:55.9606645Z scale_ub_tensor = None 2025-05-07T20:32:55.9606904Z 2025-05-07T20:32:55.9607138Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:55.9607458Z op = silu_mul_quant 2025-05-07T20:32:55.9607718Z if compiled: 2025-05-07T20:32:55.9607971Z op = torch.compile(op) 2025-05-07T20:32:55.9608272Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:55.9608566Z 2025-05-07T20:32:55.9608759Z > y_fp8, y_scale = fn() 2025-05-07T20:32:55.9608927Z 2025-05-07T20:32:55.9609025Z moe/activation_test.py:117: 2025-05-07T20:32:55.9609332Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:55.9609682Z moe/activation_test.py:115: in fn 2025-05-07T20:32:55.9610034Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:55.9610831Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:55.9611576Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:55.9612141Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:55.9612931Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:55.9613642Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:55.9614212Z kernel = self.compile( 2025-05-07T20:32:55.9614788Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:55.9615482Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:55.9615899Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:55.9616144Z 2025-05-07T20:32:55.9616362Z self = 2025-05-07T20:32:55.9617528Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:55.9619042Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd1baa533a0>} 2025-05-07T20:32:55.9620572Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:55.9621683Z context = 2025-05-07T20:32:55.9621992Z 2025-05-07T20:32:55.9622167Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:55.9622711Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:55.9623200Z module_map=module_map) 2025-05-07T20:32:55.9623577Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:55.9623937Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:55.9624205Z E ^ 2025-05-07T20:32:55.9624697Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:55.9625236Z 2025-05-07T20:32:55.9625690Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:55.9626245Z 2025-05-07T20:32:55.9626350Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:55.9626781Z self=, 2025-05-07T20:32:55.9627202Z T=1, 2025-05-07T20:32:55.9627382Z D=7168, 2025-05-07T20:32:55.9627580Z scale_ub=None, 2025-05-07T20:32:55.9627792Z contiguous=True, 2025-05-07T20:32:55.9628016Z compiled=True, 2025-05-07T20:32:55.9628214Z ) 2025-05-07T20:32:55.9628538Z self = 2025-05-07T20:32:55.9629047Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:55.9629326Z 2025-05-07T20:32:55.9629403Z @given( 2025-05-07T20:32:55.9629633Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:55.9630049Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:55.9630366Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:55.9630710Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:55.9631044Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:55.9631341Z ) 2025-05-07T20:32:55.9631775Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:55.9632279Z def test_silu_mul_quant( 2025-05-07T20:32:55.9632514Z self, 2025-05-07T20:32:55.9632710Z T: int, 2025-05-07T20:32:55.9632911Z D: int, 2025-05-07T20:32:55.9633122Z scale_ub: Optional[float], 2025-05-07T20:32:55.9633400Z contiguous: bool, 2025-05-07T20:32:55.9633641Z compiled: bool, 2025-05-07T20:32:55.9633898Z ) -> None: 2025-05-07T20:32:55.9634112Z torch.manual_seed(2025) 2025-05-07T20:32:55.9634354Z 2025-05-07T20:32:55.9634622Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:55.9634980Z 2025-05-07T20:32:55.9635172Z x_sign = torch.sign(x) 2025-05-07T20:32:55.9635461Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:55.9635777Z x = x_sign * x_clamp 2025-05-07T20:32:55.9636019Z x0 = x[:, :D] 2025-05-07T20:32:55.9636231Z x1 = x[:, D:] 2025-05-07T20:32:55.9636446Z 2025-05-07T20:32:55.9636629Z if contiguous: 2025-05-07T20:32:55.9636856Z x0 = x0.contiguous() 2025-05-07T20:32:55.9637118Z x1 = x1.contiguous() 2025-05-07T20:32:55.9637359Z 2025-05-07T20:32:55.9637546Z if scale_ub is not None: 2025-05-07T20:32:55.9637818Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:55.9638159Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:55.9638482Z ) 2025-05-07T20:32:55.9638665Z else: 2025-05-07T20:32:55.9638870Z scale_ub_tensor = None 2025-05-07T20:32:55.9639128Z 2025-05-07T20:32:55.9639354Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:55.9639675Z op = silu_mul_quant 2025-05-07T20:32:55.9639933Z if compiled: 2025-05-07T20:32:55.9640174Z op = torch.compile(op) 2025-05-07T20:32:55.9640476Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:55.9640765Z 2025-05-07T20:32:55.9640953Z y_fp8, y_scale = fn() 2025-05-07T20:32:55.9641248Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:55.9641549Z 2025-05-07T20:32:55.9641786Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:55.9642134Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:55.9642435Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:55.9642757Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:55.9643125Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:55.9643453Z 2025-05-07T20:32:55.9643707Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:55.9643910Z 2025-05-07T20:32:55.9644009Z moe/activation_test.py:126: 2025-05-07T20:32:55.9644316Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:55.9644664Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:55.9644998Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:55.9645853Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:55.9646668Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:55.9647248Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:55.9647974Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:55.9648717Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:55.9649492Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:55.9650350Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:55.9651190Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:55.9651973Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:55.9652690Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:55.9653328Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:55.9653873Z fn() 2025-05-07T20:32:55.9654454Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:55.9655072Z self.fn.run( 2025-05-07T20:32:55.9655555Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:55.9656118Z kernel = self.compile( 2025-05-07T20:32:55.9656690Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:55.9657390Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:55.9657798Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:55.9658050Z 2025-05-07T20:32:55.9658263Z self = 2025-05-07T20:32:55.9659432Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:55.9660950Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd1ba9ff9d0>} 2025-05-07T20:32:55.9662421Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:55.9663539Z context = 2025-05-07T20:32:55.9663860Z 2025-05-07T20:32:55.9664030Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:55.9664579Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:55.9665064Z module_map=module_map) 2025-05-07T20:32:55.9665444Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:55.9665812Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:55.9666077Z E ^ 2025-05-07T20:32:55.9666616Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:55.9667117Z 2025-05-07T20:32:55.9667565Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:55.9668120Z 2025-05-07T20:32:55.9668234Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:55.9668656Z self=, 2025-05-07T20:32:55.9669079Z T=4096, 2025-05-07T20:32:55.9669264Z D=5120, 2025-05-07T20:32:55.9669452Z scale_ub=None, 2025-05-07T20:32:55.9669668Z contiguous=False, 2025-05-07T20:32:55.9669995Z compiled=False, 2025-05-07T20:32:55.9670200Z ) 2025-05-07T20:32:57.7180177Z self = 2025-05-07T20:32:57.7180763Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:57.7181083Z 2025-05-07T20:32:57.7181190Z @given( 2025-05-07T20:32:57.7181436Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:57.7181824Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:57.7182285Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:57.7182934Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:57.7183532Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:57.7183846Z ) 2025-05-07T20:32:57.7184272Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:57.7184743Z def test_silu_mul_quant( 2025-05-07T20:32:57.7184996Z self, 2025-05-07T20:32:57.7185198Z T: int, 2025-05-07T20:32:57.7185394Z D: int, 2025-05-07T20:32:57.7185621Z scale_ub: Optional[float], 2025-05-07T20:32:57.7185966Z contiguous: bool, 2025-05-07T20:32:57.7186208Z compiled: bool, 2025-05-07T20:32:57.7186431Z ) -> None: 2025-05-07T20:32:57.7186645Z torch.manual_seed(2025) 2025-05-07T20:32:57.7186886Z 2025-05-07T20:32:57.7187168Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:57.7187528Z 2025-05-07T20:32:57.7187716Z x_sign = torch.sign(x) 2025-05-07T20:32:57.7188011Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:57.7188335Z x = x_sign * x_clamp 2025-05-07T20:32:57.7188620Z x0 = x[:, :D] 2025-05-07T20:32:57.7188840Z x1 = x[:, D:] 2025-05-07T20:32:57.7189046Z 2025-05-07T20:32:57.7189237Z if contiguous: 2025-05-07T20:32:57.7189471Z x0 = x0.contiguous() 2025-05-07T20:32:57.7189729Z x1 = x1.contiguous() 2025-05-07T20:32:57.7190063Z 2025-05-07T20:32:57.7190247Z if scale_ub is not None: 2025-05-07T20:32:57.7190524Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:57.7190869Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:57.7191184Z ) 2025-05-07T20:32:57.7191380Z else: 2025-05-07T20:32:57.7191589Z scale_ub_tensor = None 2025-05-07T20:32:57.7191844Z 2025-05-07T20:32:57.7192075Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:57.7192393Z op = silu_mul_quant 2025-05-07T20:32:57.7192649Z if compiled: 2025-05-07T20:32:57.7192898Z op = torch.compile(op) 2025-05-07T20:32:57.7193196Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:57.7193483Z 2025-05-07T20:32:57.7193674Z > y_fp8, y_scale = fn() 2025-05-07T20:32:57.7193845Z 2025-05-07T20:32:57.7193944Z moe/activation_test.py:117: 2025-05-07T20:32:57.7194244Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:57.7194591Z moe/activation_test.py:115: in fn 2025-05-07T20:32:57.7194879Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:57.7195618Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:57.7196435Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:57.7197008Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:57.7197732Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:57.7198444Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:57.7199016Z kernel = self.compile( 2025-05-07T20:32:57.7199595Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:57.7200290Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:57.7200703Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:57.7200950Z 2025-05-07T20:32:57.7201171Z self = 2025-05-07T20:32:57.7202350Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:57.7203905Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd1ba86fe50>} 2025-05-07T20:32:57.7205435Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:57.7206533Z context = 2025-05-07T20:32:57.7206876Z 2025-05-07T20:32:57.7207054Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:57.7207598Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:57.7208093Z module_map=module_map) 2025-05-07T20:32:57.7208469Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:57.7208832Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:57.7209090Z E ^ 2025-05-07T20:32:57.7209581Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:57.7210072Z 2025-05-07T20:32:57.7210525Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:57.7211076Z 2025-05-07T20:32:57.7211183Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:57.7211606Z self=, 2025-05-07T20:32:57.7212032Z T=4096, 2025-05-07T20:32:57.7212223Z D=7168, 2025-05-07T20:32:57.7212408Z scale_ub=None, 2025-05-07T20:32:57.7212624Z contiguous=False, 2025-05-07T20:32:57.7212854Z compiled=False, 2025-05-07T20:32:57.7213051Z ) 2025-05-07T20:32:57.7213376Z self = 2025-05-07T20:32:57.7213896Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:57.7214186Z 2025-05-07T20:32:57.7214260Z @given( 2025-05-07T20:32:57.7214495Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:57.7214819Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:57.7215136Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:57.7215479Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:57.7215818Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:57.7216106Z ) 2025-05-07T20:32:57.7216469Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:57.7216943Z def test_silu_mul_quant( 2025-05-07T20:32:57.7223198Z self, 2025-05-07T20:32:57.7223436Z T: int, 2025-05-07T20:32:57.7223712Z D: int, 2025-05-07T20:32:57.7223936Z scale_ub: Optional[float], 2025-05-07T20:32:57.7224214Z contiguous: bool, 2025-05-07T20:32:57.7224453Z compiled: bool, 2025-05-07T20:32:57.7224688Z ) -> None: 2025-05-07T20:32:57.7224918Z torch.manual_seed(2025) 2025-05-07T20:32:57.7225161Z 2025-05-07T20:32:57.7225449Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:57.7225820Z 2025-05-07T20:32:57.7226014Z x_sign = torch.sign(x) 2025-05-07T20:32:57.7226316Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:57.7226639Z x = x_sign * x_clamp 2025-05-07T20:32:57.7226885Z x0 = x[:, :D] 2025-05-07T20:32:57.7227106Z x1 = x[:, D:] 2025-05-07T20:32:57.7227322Z 2025-05-07T20:32:57.7227513Z if contiguous: 2025-05-07T20:32:57.7227746Z x0 = x0.contiguous() 2025-05-07T20:32:57.7228012Z x1 = x1.contiguous() 2025-05-07T20:32:57.7228260Z 2025-05-07T20:32:57.7228452Z if scale_ub is not None: 2025-05-07T20:32:57.7228732Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:57.7229085Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:57.7229400Z ) 2025-05-07T20:32:57.7229597Z else: 2025-05-07T20:32:57.7229985Z scale_ub_tensor = None 2025-05-07T20:32:57.7230246Z 2025-05-07T20:32:57.7230531Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:57.7230866Z op = silu_mul_quant 2025-05-07T20:32:57.7231118Z if compiled: 2025-05-07T20:32:57.7231378Z op = torch.compile(op) 2025-05-07T20:32:57.7231692Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:57.7231977Z 2025-05-07T20:32:57.7232216Z > y_fp8, y_scale = fn() 2025-05-07T20:32:57.7232391Z 2025-05-07T20:32:57.7232493Z moe/activation_test.py:117: 2025-05-07T20:32:57.7232803Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:57.7233147Z moe/activation_test.py:115: in fn 2025-05-07T20:32:57.7233440Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:57.7234184Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:57.7234931Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:57.7235493Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:57.7236230Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:57.7236937Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:57.7237512Z kernel = self.compile( 2025-05-07T20:32:57.7238078Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:57.7238779Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:57.7239197Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:57.7239440Z 2025-05-07T20:32:57.7239660Z self = 2025-05-07T20:32:57.7240833Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:57.7242347Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd1ba5a8a60>} 2025-05-07T20:32:57.7243820Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:57.7244979Z context = 2025-05-07T20:32:57.7245288Z 2025-05-07T20:32:57.7245456Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:57.7246005Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:57.7246501Z module_map=module_map) 2025-05-07T20:32:57.7246880Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:57.7247239Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:57.7247510Z E ^ 2025-05-07T20:32:57.7248006Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:57.7248500Z 2025-05-07T20:32:57.7248950Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:57.7249513Z 2025-05-07T20:32:57.7249617Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:57.7250050Z self=, 2025-05-07T20:32:57.7250472Z T=128, 2025-05-07T20:32:57.7250654Z D=7168, 2025-05-07T20:32:57.7250844Z scale_ub=None, 2025-05-07T20:32:57.7251065Z contiguous=False, 2025-05-07T20:32:57.7251331Z compiled=True, 2025-05-07T20:32:57.7251537Z ) 2025-05-07T20:32:57.7997398Z self = 2025-05-07T20:32:57.7998038Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:57.7998336Z 2025-05-07T20:32:57.7998411Z @given( 2025-05-07T20:32:57.7998649Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:57.7999110Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:57.7999552Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:57.7999903Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:57.8000254Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:57.8000554Z ) 2025-05-07T20:32:57.8000921Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:57.8001397Z def test_silu_mul_quant( 2025-05-07T20:32:57.8001645Z self, 2025-05-07T20:32:57.8001846Z T: int, 2025-05-07T20:32:57.8002055Z D: int, 2025-05-07T20:32:57.8002276Z scale_ub: Optional[float], 2025-05-07T20:32:57.8002561Z contiguous: bool, 2025-05-07T20:32:57.8002809Z compiled: bool, 2025-05-07T20:32:57.8003045Z ) -> None: 2025-05-07T20:32:57.8003260Z torch.manual_seed(2025) 2025-05-07T20:32:57.8003514Z 2025-05-07T20:32:57.8003798Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:57.8004158Z 2025-05-07T20:32:57.8004356Z x_sign = torch.sign(x) 2025-05-07T20:32:57.8004657Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:57.8004977Z x = x_sign * x_clamp 2025-05-07T20:32:57.8005229Z x0 = x[:, :D] 2025-05-07T20:32:57.8005452Z x1 = x[:, D:] 2025-05-07T20:32:57.8005662Z 2025-05-07T20:32:57.8005854Z if contiguous: 2025-05-07T20:32:57.8006092Z x0 = x0.contiguous() 2025-05-07T20:32:57.8006355Z x1 = x1.contiguous() 2025-05-07T20:32:57.8006608Z 2025-05-07T20:32:57.8006808Z if scale_ub is not None: 2025-05-07T20:32:57.8007086Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:57.8007442Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:57.8007771Z ) 2025-05-07T20:32:57.8007978Z else: 2025-05-07T20:32:57.8008190Z scale_ub_tensor = None 2025-05-07T20:32:57.8008454Z 2025-05-07T20:32:57.8008692Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:57.8009023Z op = silu_mul_quant 2025-05-07T20:32:57.8009280Z if compiled: 2025-05-07T20:32:57.8009530Z op = torch.compile(op) 2025-05-07T20:32:57.8009903Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:57.8010186Z 2025-05-07T20:32:57.8010378Z y_fp8, y_scale = fn() 2025-05-07T20:32:57.8010663Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:57.8010959Z 2025-05-07T20:32:57.8011200Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:57.8011539Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:57.8011842Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:57.8012169Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:57.8012541Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:57.8012856Z 2025-05-07T20:32:57.8013055Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:57.8013261Z 2025-05-07T20:32:57.8013367Z moe/activation_test.py:126: 2025-05-07T20:32:57.8013668Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:57.8014020Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:57.8014353Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:57.8015201Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:57.8016088Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:57.8016673Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:57.8017449Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:57.8018187Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:57.8019005Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:57.8019820Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:57.8020628Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:57.8021413Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:57.8022111Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:57.8022761Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:57.8023326Z fn() 2025-05-07T20:32:57.8023863Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:57.8024488Z self.fn.run( 2025-05-07T20:32:57.8024978Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:57.8025546Z kernel = self.compile( 2025-05-07T20:32:57.8026124Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:57.8026827Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:57.8027242Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:57.8027493Z 2025-05-07T20:32:57.8027709Z self = 2025-05-07T20:32:57.8028894Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:57.8030542Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd1ba5e2550>} 2025-05-07T20:32:57.8032092Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:57.8033202Z context = 2025-05-07T20:32:57.8033513Z 2025-05-07T20:32:57.8033682Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:57.8034235Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:57.8034732Z module_map=module_map) 2025-05-07T20:32:57.8035104Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:57.8035468Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:57.8035741Z E ^ 2025-05-07T20:32:57.8036230Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:57.8036732Z 2025-05-07T20:32:57.8037184Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:57.8037749Z 2025-05-07T20:32:57.8037854Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:57.8038280Z self=, 2025-05-07T20:32:57.8038700Z T=128, 2025-05-07T20:32:57.8038886Z D=7168, 2025-05-07T20:32:57.8039146Z scale_ub=None, 2025-05-07T20:32:57.8039359Z contiguous=False, 2025-05-07T20:32:57.8039625Z compiled=False, 2025-05-07T20:32:57.8039834Z ) 2025-05-07T20:32:58.2018815Z self = 2025-05-07T20:32:58.2020147Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:58.2020757Z 2025-05-07T20:32:58.2020841Z @given( 2025-05-07T20:32:58.2021215Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:58.2021541Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:58.2021861Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:58.2022212Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:58.2022545Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:58.2022842Z ) 2025-05-07T20:32:58.2023215Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:58.2023681Z def test_silu_mul_quant( 2025-05-07T20:32:58.2023929Z self, 2025-05-07T20:32:58.2024125Z T: int, 2025-05-07T20:32:58.2024320Z D: int, 2025-05-07T20:32:58.2024545Z scale_ub: Optional[float], 2025-05-07T20:32:58.2024828Z contiguous: bool, 2025-05-07T20:32:58.2025070Z compiled: bool, 2025-05-07T20:32:58.2025289Z ) -> None: 2025-05-07T20:32:58.2025507Z torch.manual_seed(2025) 2025-05-07T20:32:58.2025757Z 2025-05-07T20:32:58.2026032Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:58.2026393Z 2025-05-07T20:32:58.2026594Z x_sign = torch.sign(x) 2025-05-07T20:32:58.2026888Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:58.2027208Z x = x_sign * x_clamp 2025-05-07T20:32:58.2027452Z x0 = x[:, :D] 2025-05-07T20:32:58.2027669Z x1 = x[:, D:] 2025-05-07T20:32:58.2027880Z 2025-05-07T20:32:58.2028059Z if contiguous: 2025-05-07T20:32:58.2028285Z x0 = x0.contiguous() 2025-05-07T20:32:58.2028549Z x1 = x1.contiguous() 2025-05-07T20:32:58.2028790Z 2025-05-07T20:32:58.2028974Z if scale_ub is not None: 2025-05-07T20:32:58.2029249Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:58.2029590Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:58.2030089Z ) 2025-05-07T20:32:58.2030283Z else: 2025-05-07T20:32:58.2030495Z scale_ub_tensor = None 2025-05-07T20:32:58.2030753Z 2025-05-07T20:32:58.2030979Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:58.2031302Z op = silu_mul_quant 2025-05-07T20:32:58.2031633Z if compiled: 2025-05-07T20:32:58.2031877Z op = torch.compile(op) 2025-05-07T20:32:58.2032175Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:58.2032455Z 2025-05-07T20:32:58.2032636Z > y_fp8, y_scale = fn() 2025-05-07T20:32:58.2032807Z 2025-05-07T20:32:58.2032906Z moe/activation_test.py:117: 2025-05-07T20:32:58.2033203Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.2033541Z moe/activation_test.py:115: in fn 2025-05-07T20:32:58.2033823Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:58.2034561Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:58.2035311Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:58.2035873Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:58.2036605Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:58.2037313Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:58.2037872Z kernel = self.compile( 2025-05-07T20:32:58.2038507Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:58.2039257Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:58.2039672Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.2039912Z 2025-05-07T20:32:58.2040124Z self = 2025-05-07T20:32:58.2041294Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:58.2042849Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd1ba5a8ee0>} 2025-05-07T20:32:58.2044318Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:58.2045427Z context = 2025-05-07T20:32:58.2045732Z 2025-05-07T20:32:58.2045900Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:58.2046445Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:58.2046934Z module_map=module_map) 2025-05-07T20:32:58.2047302Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:58.2047660Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:58.2047927Z E ^ 2025-05-07T20:32:58.2048418Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:58.2048905Z 2025-05-07T20:32:58.2049353Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:58.2049913Z 2025-05-07T20:32:58.2050013Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:58.2050440Z self=, 2025-05-07T20:32:58.2050863Z T=4096, 2025-05-07T20:32:58.2051040Z D=5120, 2025-05-07T20:32:58.2051230Z scale_ub=1200.0, 2025-05-07T20:32:58.2051456Z contiguous=True, 2025-05-07T20:32:58.2051672Z compiled=False, 2025-05-07T20:32:58.2051879Z ) 2025-05-07T20:32:58.2052201Z self = 2025-05-07T20:32:58.2052761Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:58.2053059Z 2025-05-07T20:32:58.2053134Z @given( 2025-05-07T20:32:58.2053362Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:58.2053675Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:58.2053989Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:58.2054331Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:58.2054669Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:58.2054958Z ) 2025-05-07T20:32:58.2055318Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:58.2055781Z def test_silu_mul_quant( 2025-05-07T20:32:58.2056019Z self, 2025-05-07T20:32:58.2056214Z T: int, 2025-05-07T20:32:58.2056408Z D: int, 2025-05-07T20:32:58.2056620Z scale_ub: Optional[float], 2025-05-07T20:32:58.2056893Z contiguous: bool, 2025-05-07T20:32:58.2057134Z compiled: bool, 2025-05-07T20:32:58.2057355Z ) -> None: 2025-05-07T20:32:58.2057570Z torch.manual_seed(2025) 2025-05-07T20:32:58.2057815Z 2025-05-07T20:32:58.2058081Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:58.2058434Z 2025-05-07T20:32:58.2058624Z x_sign = torch.sign(x) 2025-05-07T20:32:58.2058958Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:58.2059277Z x = x_sign * x_clamp 2025-05-07T20:32:58.2059555Z x0 = x[:, :D] 2025-05-07T20:32:58.2059771Z x1 = x[:, D:] 2025-05-07T20:32:58.2059972Z 2025-05-07T20:32:58.2060154Z if contiguous: 2025-05-07T20:32:58.2060387Z x0 = x0.contiguous() 2025-05-07T20:32:58.2060638Z x1 = x1.contiguous() 2025-05-07T20:32:58.2060877Z 2025-05-07T20:32:58.2061111Z if scale_ub is not None: 2025-05-07T20:32:58.2061382Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:58.2061727Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:58.2062050Z ) 2025-05-07T20:32:58.2062243Z else: 2025-05-07T20:32:58.2062456Z scale_ub_tensor = None 2025-05-07T20:32:58.2062706Z 2025-05-07T20:32:58.2062927Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:58.2063247Z op = silu_mul_quant 2025-05-07T20:32:58.2063496Z if compiled: 2025-05-07T20:32:58.2063735Z op = torch.compile(op) 2025-05-07T20:32:58.2064038Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:58.2064318Z 2025-05-07T20:32:58.2064498Z > y_fp8, y_scale = fn() 2025-05-07T20:32:58.2064668Z 2025-05-07T20:32:58.2064765Z moe/activation_test.py:117: 2025-05-07T20:32:58.2065066Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.2065413Z moe/activation_test.py:115: in fn 2025-05-07T20:32:58.2065693Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:58.2066434Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:58.2067175Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:58.2067734Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:58.2068466Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:58.2069171Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:58.2069735Z kernel = self.compile( 2025-05-07T20:32:58.2070391Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:58.2071092Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:58.2071505Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.2071745Z 2025-05-07T20:32:58.2072007Z self = 2025-05-07T20:32:58.2073166Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:58.2074667Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd1ba1350d0>} 2025-05-07T20:32:58.2076140Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:58.2077248Z context = 2025-05-07T20:32:58.2077552Z 2025-05-07T20:32:58.2077718Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:58.2078263Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:58.2078749Z module_map=module_map) 2025-05-07T20:32:58.2079126Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:58.2079483Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:58.2079839Z E ^ 2025-05-07T20:32:58.2080329Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:58.2080854Z 2025-05-07T20:32:58.2081308Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:58.2081862Z 2025-05-07T20:32:58.2081962Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:58.2082426Z self=, 2025-05-07T20:32:58.2083009Z T=1, 2025-05-07T20:32:58.2083185Z D=5120, 2025-05-07T20:32:58.2083378Z scale_ub=None, 2025-05-07T20:32:58.2083588Z contiguous=True, 2025-05-07T20:32:58.2083803Z compiled=True, 2025-05-07T20:32:58.2084001Z ) 2025-05-07T20:32:58.8597294Z self = 2025-05-07T20:32:58.8597851Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:58.8598214Z 2025-05-07T20:32:58.8598339Z @given( 2025-05-07T20:32:58.8598695Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:58.8599149Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:58.8599592Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:58.8600031Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:58.8600470Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:58.8600855Z ) 2025-05-07T20:32:58.8601230Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:58.8601700Z def test_silu_mul_quant( 2025-05-07T20:32:58.8601947Z self, 2025-05-07T20:32:58.8602139Z T: int, 2025-05-07T20:32:58.8602341Z D: int, 2025-05-07T20:32:58.8602563Z scale_ub: Optional[float], 2025-05-07T20:32:58.8602835Z contiguous: bool, 2025-05-07T20:32:58.8603081Z compiled: bool, 2025-05-07T20:32:58.8603314Z ) -> None: 2025-05-07T20:32:58.8603531Z torch.manual_seed(2025) 2025-05-07T20:32:58.8603782Z 2025-05-07T20:32:58.8604057Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:58.8604410Z 2025-05-07T20:32:58.8604614Z x_sign = torch.sign(x) 2025-05-07T20:32:58.8604935Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:58.8605281Z x = x_sign * x_clamp 2025-05-07T20:32:58.8605538Z x0 = x[:, :D] 2025-05-07T20:32:58.8605766Z x1 = x[:, D:] 2025-05-07T20:32:58.8605987Z 2025-05-07T20:32:58.8606180Z if contiguous: 2025-05-07T20:32:58.8606428Z x0 = x0.contiguous() 2025-05-07T20:32:58.8606863Z x1 = x1.contiguous() 2025-05-07T20:32:58.8607105Z 2025-05-07T20:32:58.8607297Z if scale_ub is not None: 2025-05-07T20:32:58.8613098Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:58.8613495Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:58.8613830Z ) 2025-05-07T20:32:58.8614028Z else: 2025-05-07T20:32:58.8614248Z scale_ub_tensor = None 2025-05-07T20:32:58.8614513Z 2025-05-07T20:32:58.8614753Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:58.8615100Z op = silu_mul_quant 2025-05-07T20:32:58.8615364Z if compiled: 2025-05-07T20:32:58.8615639Z op = torch.compile(op) 2025-05-07T20:32:58.8615947Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:58.8616235Z 2025-05-07T20:32:58.8616432Z y_fp8, y_scale = fn() 2025-05-07T20:32:58.8616718Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:58.8617027Z 2025-05-07T20:32:58.8617266Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:58.8617612Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:58.8617924Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:58.8618352Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:58.8618726Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:58.8619107Z 2025-05-07T20:32:58.8619311Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:58.8619513Z 2025-05-07T20:32:58.8619622Z moe/activation_test.py:126: 2025-05-07T20:32:58.8619923Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.8620271Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:58.8620668Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:58.8621521Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:58.8622335Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:58.8622912Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:58.8623653Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:58.8624389Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:58.8625180Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:58.8625987Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:58.8626796Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:58.8627576Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:58.8628256Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:58.8628896Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:58.8629440Z fn() 2025-05-07T20:32:58.8630168Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:58.8630844Z self.fn.run( 2025-05-07T20:32:58.8631329Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:58.8631890Z kernel = self.compile( 2025-05-07T20:32:58.8632453Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:58.8633155Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:58.8633619Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.8633862Z 2025-05-07T20:32:58.8634074Z self = 2025-05-07T20:32:58.8635245Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:58.8636756Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd1b9dfb4c0>} 2025-05-07T20:32:58.8638228Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:58.8639343Z context = 2025-05-07T20:32:58.8639648Z 2025-05-07T20:32:58.8639821Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:58.8640368Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:58.8640906Z module_map=module_map) 2025-05-07T20:32:58.8641324Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:58.8641686Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:58.8641995Z E ^ 2025-05-07T20:32:58.8642485Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:58.8642975Z 2025-05-07T20:32:58.8643423Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:58.8644024Z 2025-05-07T20:32:58.8644125Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:58.8644553Z self=, 2025-05-07T20:32:58.8644980Z T=2048, 2025-05-07T20:32:58.8645164Z D=5120, 2025-05-07T20:32:58.8645353Z scale_ub=None, 2025-05-07T20:32:58.8645564Z contiguous=True, 2025-05-07T20:32:58.8645777Z compiled=True, 2025-05-07T20:32:58.8645975Z ) 2025-05-07T20:32:59.4751396Z self = 2025-05-07T20:32:59.4752909Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:59.4753547Z 2025-05-07T20:32:59.4753725Z @given( 2025-05-07T20:32:59.4754187Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:59.4754842Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:59.4755481Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:59.4756162Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:59.4756852Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:59.4757455Z ) 2025-05-07T20:32:59.4758198Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:59.4759133Z def test_silu_mul_quant( 2025-05-07T20:32:59.4759636Z self, 2025-05-07T20:32:59.4760030Z T: int, 2025-05-07T20:32:59.4760428Z D: int, 2025-05-07T20:32:59.4760870Z scale_ub: Optional[float], 2025-05-07T20:32:59.4761167Z contiguous: bool, 2025-05-07T20:32:59.4761415Z compiled: bool, 2025-05-07T20:32:59.4761649Z ) -> None: 2025-05-07T20:32:59.4761868Z torch.manual_seed(2025) 2025-05-07T20:32:59.4762112Z 2025-05-07T20:32:59.4762392Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:59.4762755Z 2025-05-07T20:32:59.4762942Z x_sign = torch.sign(x) 2025-05-07T20:32:59.4763233Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:59.4763560Z x = x_sign * x_clamp 2025-05-07T20:32:59.4763799Z x0 = x[:, :D] 2025-05-07T20:32:59.4764017Z x1 = x[:, D:] 2025-05-07T20:32:59.4764230Z 2025-05-07T20:32:59.4764545Z if contiguous: 2025-05-07T20:32:59.4764778Z x0 = x0.contiguous() 2025-05-07T20:32:59.4765039Z x1 = x1.contiguous() 2025-05-07T20:32:59.4765291Z 2025-05-07T20:32:59.4765475Z if scale_ub is not None: 2025-05-07T20:32:59.4765750Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:59.4766095Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:59.4766412Z ) 2025-05-07T20:32:59.4766599Z else: 2025-05-07T20:32:59.4766804Z scale_ub_tensor = None 2025-05-07T20:32:59.4767050Z 2025-05-07T20:32:59.4767279Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:59.4767597Z op = silu_mul_quant 2025-05-07T20:32:59.4767841Z if compiled: 2025-05-07T20:32:59.4768085Z op = torch.compile(op) 2025-05-07T20:32:59.4768382Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:59.4768654Z 2025-05-07T20:32:59.4768843Z y_fp8, y_scale = fn() 2025-05-07T20:32:59.4769128Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:59.4769422Z 2025-05-07T20:32:59.4769656Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:59.4769996Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:59.4770358Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:59.4770674Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:59.4771142Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:59.4771459Z 2025-05-07T20:32:59.4771647Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:59.4771850Z 2025-05-07T20:32:59.4771948Z moe/activation_test.py:126: 2025-05-07T20:32:59.4772245Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:59.4772649Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:59.4772981Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:59.4773830Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:59.4774646Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:59.4775215Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:59.4775948Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:59.4776684Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:59.4777451Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:59.4778249Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:59.4779053Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:59.4779832Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:59.4780517Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:59.4781198Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:59.4781750Z fn() 2025-05-07T20:32:59.4782288Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:59.4783095Z self.fn.run( 2025-05-07T20:32:59.4783584Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:59.4784146Z kernel = self.compile( 2025-05-07T20:32:59.4784718Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:59.4785483Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:59.4785892Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:59.4786131Z 2025-05-07T20:32:59.4786347Z self = 2025-05-07T20:32:59.4787508Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:59.4789017Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd1b9a47f70>} 2025-05-07T20:32:59.4790604Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:59.4791716Z context = 2025-05-07T20:32:59.4792019Z 2025-05-07T20:32:59.4792191Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:59.4792734Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:59.4793288Z module_map=module_map) 2025-05-07T20:32:59.4793666Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:59.4794079Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:59.4794346Z E ^ 2025-05-07T20:32:59.4794830Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:59.4795314Z 2025-05-07T20:32:59.4795766Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:59.4796376Z 2025-05-07T20:32:59.4796476Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:59.4796903Z self=, 2025-05-07T20:32:59.4797320Z T=128, 2025-05-07T20:32:59.4797496Z D=5120, 2025-05-07T20:32:59.4797685Z scale_ub=None, 2025-05-07T20:32:59.4797895Z contiguous=True, 2025-05-07T20:32:59.4798114Z compiled=True, 2025-05-07T20:32:59.4798309Z ) 2025-05-07T20:33:00.4517228Z self = 2025-05-07T20:33:00.4518002Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:00.4518397Z 2025-05-07T20:33:00.4518511Z @given( 2025-05-07T20:33:00.4518821Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:00.4519267Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:00.4519597Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:00.4519953Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:00.4520294Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:00.4520601Z ) 2025-05-07T20:33:00.4520973Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:00.4521449Z def test_silu_mul_quant( 2025-05-07T20:33:00.4521701Z self, 2025-05-07T20:33:00.4521902Z T: int, 2025-05-07T20:33:00.4522103Z D: int, 2025-05-07T20:33:00.4522337Z scale_ub: Optional[float], 2025-05-07T20:33:00.4522633Z contiguous: bool, 2025-05-07T20:33:00.4522880Z compiled: bool, 2025-05-07T20:33:00.4523118Z ) -> None: 2025-05-07T20:33:00.4523340Z torch.manual_seed(2025) 2025-05-07T20:33:00.4523589Z 2025-05-07T20:33:00.4523872Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:00.4524237Z 2025-05-07T20:33:00.4524435Z x_sign = torch.sign(x) 2025-05-07T20:33:00.4524734Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:00.4525057Z x = x_sign * x_clamp 2025-05-07T20:33:00.4525432Z x0 = x[:, :D] 2025-05-07T20:33:00.4525664Z x1 = x[:, D:] 2025-05-07T20:33:00.4525882Z 2025-05-07T20:33:00.4526072Z if contiguous: 2025-05-07T20:33:00.4526318Z x0 = x0.contiguous() 2025-05-07T20:33:00.4526593Z x1 = x1.contiguous() 2025-05-07T20:33:00.4526843Z 2025-05-07T20:33:00.4527042Z if scale_ub is not None: 2025-05-07T20:33:00.4527326Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:00.4527676Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:00.4528020Z ) 2025-05-07T20:33:00.4528209Z else: 2025-05-07T20:33:00.4528422Z scale_ub_tensor = None 2025-05-07T20:33:00.4528683Z 2025-05-07T20:33:00.4528911Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:00.4529245Z op = silu_mul_quant 2025-05-07T20:33:00.4529504Z if compiled: 2025-05-07T20:33:00.4529752Z op = torch.compile(op) 2025-05-07T20:33:00.4530066Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:00.4530352Z 2025-05-07T20:33:00.4530542Z y_fp8, y_scale = fn() 2025-05-07T20:33:00.4530835Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:00.4531137Z 2025-05-07T20:33:00.4531372Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:00.4531788Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:00.4532145Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:00.4532466Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:00.4532827Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:00.4533144Z 2025-05-07T20:33:00.4533342Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:00.4533600Z 2025-05-07T20:33:00.4533701Z moe/activation_test.py:126: 2025-05-07T20:33:00.4534001Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:00.4534348Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:00.4534680Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:00.4535532Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:00.4536353Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:00.4536939Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:00.4537670Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:00.4538410Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:00.4539186Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:00.4540002Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:33:00.4540804Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:00.4541593Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:00.4542282Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:00.4542921Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:00.4543474Z fn() 2025-05-07T20:33:00.4544009Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:00.4544629Z self.fn.run( 2025-05-07T20:33:00.4545112Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:00.4545682Z kernel = self.compile( 2025-05-07T20:33:00.4546301Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:00.4546999Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:00.4547407Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:00.4547656Z 2025-05-07T20:33:00.4547873Z self = 2025-05-07T20:33:00.4549050Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:00.4550736Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd1b9d0f0d0>} 2025-05-07T20:33:00.4552218Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:00.4553329Z context = 2025-05-07T20:33:00.4553641Z 2025-05-07T20:33:00.4553810Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:00.4554405Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:00.4554926Z module_map=module_map) 2025-05-07T20:33:00.4555297Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:00.4555662Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:00.4555925Z E ^ 2025-05-07T20:33:00.4556414Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:00.4556977Z 2025-05-07T20:33:00.4557426Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:00.4557987Z 2025-05-07T20:33:00.4558095Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:00.4558516Z self=, 2025-05-07T20:33:00.4558937Z T=4096, 2025-05-07T20:33:00.4559123Z D=5120, 2025-05-07T20:33:00.4559311Z scale_ub=None, 2025-05-07T20:33:00.4559529Z contiguous=True, 2025-05-07T20:33:00.4559749Z compiled=True, 2025-05-07T20:33:00.4559947Z ) 2025-05-07T20:33:01.2942232Z self = 2025-05-07T20:33:01.2942943Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:01.2943393Z 2025-05-07T20:33:01.2943513Z @given( 2025-05-07T20:33:01.2943850Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:01.2944310Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:01.2944656Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:01.2945009Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:01.2945346Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:01.2945648Z ) 2025-05-07T20:33:01.2946027Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:01.2946498Z def test_silu_mul_quant( 2025-05-07T20:33:01.2946747Z self, 2025-05-07T20:33:01.2946945Z T: int, 2025-05-07T20:33:01.2947139Z D: int, 2025-05-07T20:33:01.2947366Z scale_ub: Optional[float], 2025-05-07T20:33:01.2947657Z contiguous: bool, 2025-05-07T20:33:01.2947904Z compiled: bool, 2025-05-07T20:33:01.2948131Z ) -> None: 2025-05-07T20:33:01.2948355Z torch.manual_seed(2025) 2025-05-07T20:33:01.2948606Z 2025-05-07T20:33:01.2948882Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:01.2949252Z 2025-05-07T20:33:01.2949453Z x_sign = torch.sign(x) 2025-05-07T20:33:01.2949752Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:01.2950325Z x = x_sign * x_clamp 2025-05-07T20:33:01.2950589Z x0 = x[:, :D] 2025-05-07T20:33:01.2950805Z x1 = x[:, D:] 2025-05-07T20:33:01.2951017Z 2025-05-07T20:33:01.2951210Z if contiguous: 2025-05-07T20:33:01.2951445Z x0 = x0.contiguous() 2025-05-07T20:33:01.2951727Z x1 = x1.contiguous() 2025-05-07T20:33:01.2951987Z 2025-05-07T20:33:01.2952180Z if scale_ub is not None: 2025-05-07T20:33:01.2952463Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:01.2952817Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:01.2953135Z ) 2025-05-07T20:33:01.2953332Z else: 2025-05-07T20:33:01.2953548Z scale_ub_tensor = None 2025-05-07T20:33:01.2953808Z 2025-05-07T20:33:01.2954033Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:01.2954360Z op = silu_mul_quant 2025-05-07T20:33:01.2954611Z if compiled: 2025-05-07T20:33:01.2954851Z op = torch.compile(op) 2025-05-07T20:33:01.2955152Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:01.2955431Z 2025-05-07T20:33:01.2955612Z y_fp8, y_scale = fn() 2025-05-07T20:33:01.2955897Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:01.2956194Z 2025-05-07T20:33:01.2956493Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:01.2956894Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:01.2957196Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:01.2957512Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:01.2957881Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:01.2958268Z 2025-05-07T20:33:01.2958467Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:01.2958667Z 2025-05-07T20:33:01.2958765Z moe/activation_test.py:126: 2025-05-07T20:33:01.2959071Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.2959414Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:01.2959743Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:01.2960594Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:01.2961456Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:01.2962034Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:01.2962757Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:01.2963494Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:01.2964271Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:01.2965078Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:33:01.2965873Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:01.2966659Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:01.2967343Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:01.2967978Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:01.2968532Z fn() 2025-05-07T20:33:01.2969069Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:01.2969689Z self.fn.run( 2025-05-07T20:33:01.2970170Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:01.2970735Z kernel = self.compile( 2025-05-07T20:33:01.2971351Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:01.2972041Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:01.2972451Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.2972698Z 2025-05-07T20:33:01.2972910Z self = 2025-05-07T20:33:01.2974085Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:01.2975599Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd1b981ddc0>} 2025-05-07T20:33:01.2977071Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:01.2978178Z context = 2025-05-07T20:33:01.2978482Z 2025-05-07T20:33:01.2978706Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:01.2979256Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:01.2979778Z module_map=module_map) 2025-05-07T20:33:01.2980151Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:01.2980515Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:01.2980781Z E ^ 2025-05-07T20:33:01.2981309Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:01.2981796Z 2025-05-07T20:33:01.2982256Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:01.2982995Z 2025-05-07T20:33:01.2983106Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:01.2983528Z self=, 2025-05-07T20:33:01.2983956Z T=16384, 2025-05-07T20:33:01.2984158Z D=5120, 2025-05-07T20:33:01.2984352Z scale_ub=None, 2025-05-07T20:33:01.2984572Z contiguous=True, 2025-05-07T20:33:01.2984799Z compiled=True, 2025-05-07T20:33:01.2985000Z ) 2025-05-07T20:33:01.3414956Z W0507 20:33:01.339931 88025 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] torch._dynamo hit config.recompile_limit (8) 2025-05-07T20:33:01.3422287Z W0507 20:33:01.339931 88025 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] function: 'silu_mul_quant' (/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:55) 2025-05-07T20:33:01.3423859Z W0507 20:33:01.339931 88025 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] last reason: 0/7: tensor 'x0' stride mismatch at index 0. expected 5120, actual 10240 2025-05-07T20:33:01.3424956Z W0507 20:33:01.339931 88025 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To log all recompilation reasons, use TORCH_LOGS="recompiles". 2025-05-07T20:33:01.3426170Z W0507 20:33:01.339931 88025 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html. 2025-05-07T20:33:01.4634024Z self = 2025-05-07T20:33:01.4634687Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:01.4635145Z 2025-05-07T20:33:01.4635297Z @given( 2025-05-07T20:33:01.4635641Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:01.4636090Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:01.4636527Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:01.4636877Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:01.4637219Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:01.4637532Z ) 2025-05-07T20:33:01.4637901Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:01.4638380Z def test_silu_mul_quant( 2025-05-07T20:33:01.4638633Z self, 2025-05-07T20:33:01.4638827Z T: int, 2025-05-07T20:33:01.4639034Z D: int, 2025-05-07T20:33:01.4639261Z scale_ub: Optional[float], 2025-05-07T20:33:01.4639539Z contiguous: bool, 2025-05-07T20:33:01.4639789Z compiled: bool, 2025-05-07T20:33:01.4640024Z ) -> None: 2025-05-07T20:33:01.4640244Z torch.manual_seed(2025) 2025-05-07T20:33:01.4640498Z 2025-05-07T20:33:01.4640789Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:01.4641171Z 2025-05-07T20:33:01.4641397Z x_sign = torch.sign(x) 2025-05-07T20:33:01.4641705Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:01.4642027Z x = x_sign * x_clamp 2025-05-07T20:33:01.4642267Z x0 = x[:, :D] 2025-05-07T20:33:01.4642483Z x1 = x[:, D:] 2025-05-07T20:33:01.4642688Z 2025-05-07T20:33:01.4642941Z if contiguous: 2025-05-07T20:33:01.4643178Z x0 = x0.contiguous() 2025-05-07T20:33:01.4643496Z x1 = x1.contiguous() 2025-05-07T20:33:01.4643739Z 2025-05-07T20:33:01.4643925Z if scale_ub is not None: 2025-05-07T20:33:01.4644204Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:01.4644546Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:01.4644859Z ) 2025-05-07T20:33:01.4645112Z else: 2025-05-07T20:33:01.4645323Z scale_ub_tensor = None 2025-05-07T20:33:01.4645573Z 2025-05-07T20:33:01.4645805Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:01.4646134Z op = silu_mul_quant 2025-05-07T20:33:01.4646382Z if compiled: 2025-05-07T20:33:01.4646636Z op = torch.compile(op) 2025-05-07T20:33:01.4646938Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:01.4647226Z 2025-05-07T20:33:01.4647412Z y_fp8, y_scale = fn() 2025-05-07T20:33:01.4647701Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:01.4648001Z 2025-05-07T20:33:01.4648234Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:01.4648578Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:01.4648879Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:01.4649195Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:01.4649569Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:01.4649891Z 2025-05-07T20:33:01.4650086Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:01.4650291Z 2025-05-07T20:33:01.4650392Z moe/activation_test.py:126: 2025-05-07T20:33:01.4650695Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.4651043Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:01.4651371Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:01.4652226Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:01.4653047Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:01.4653619Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:01.4654353Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:01.4655094Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:01.4655912Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:01.4656717Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:33:01.4657518Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:01.4658305Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:01.4658990Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:01.4659627Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:01.4660181Z fn() 2025-05-07T20:33:01.4660720Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:01.4661340Z self.fn.run( 2025-05-07T20:33:01.4661833Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:01.4662400Z kernel = self.compile( 2025-05-07T20:33:01.4662971Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:01.4663664Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:01.4664141Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.4664450Z 2025-05-07T20:33:01.4664668Z self = 2025-05-07T20:33:01.4665841Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:01.4667392Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd1b9aed310>} 2025-05-07T20:33:01.4668861Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:01.4670122Z context = 2025-05-07T20:33:01.4670429Z 2025-05-07T20:33:01.4670613Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:01.4671181Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:01.4671698Z module_map=module_map) 2025-05-07T20:33:01.4672073Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:01.4672442Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:01.4672711Z E ^ 2025-05-07T20:33:01.4673202Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:01.4673691Z 2025-05-07T20:33:01.4674141Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:01.4674696Z 2025-05-07T20:33:01.4674801Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:01.4675227Z self=, 2025-05-07T20:33:01.4675655Z T=1, 2025-05-07T20:33:01.4675837Z D=5120, 2025-05-07T20:33:01.4676027Z scale_ub=1200.0, 2025-05-07T20:33:01.4676252Z contiguous=True, 2025-05-07T20:33:01.4676475Z compiled=True, 2025-05-07T20:33:01.4676673Z ) 2025-05-07T20:33:01.6380583Z self = 2025-05-07T20:33:01.6381496Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:01.6382358Z 2025-05-07T20:33:01.6382604Z @given( 2025-05-07T20:33:01.6383560Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:01.6384706Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:01.6385338Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:01.6386013Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:01.6386674Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:01.6387260Z ) 2025-05-07T20:33:01.6387976Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:01.6388895Z def test_silu_mul_quant( 2025-05-07T20:33:01.6389376Z self, 2025-05-07T20:33:01.6389753Z T: int, 2025-05-07T20:33:01.6390318Z D: int, 2025-05-07T20:33:01.6390747Z scale_ub: Optional[float], 2025-05-07T20:33:01.6391289Z contiguous: bool, 2025-05-07T20:33:01.6391671Z compiled: bool, 2025-05-07T20:33:01.6391902Z ) -> None: 2025-05-07T20:33:01.6392121Z torch.manual_seed(2025) 2025-05-07T20:33:01.6392363Z 2025-05-07T20:33:01.6392647Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:01.6393002Z 2025-05-07T20:33:01.6393195Z x_sign = torch.sign(x) 2025-05-07T20:33:01.6393487Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:01.6393803Z x = x_sign * x_clamp 2025-05-07T20:33:01.6394048Z x0 = x[:, :D] 2025-05-07T20:33:01.6394326Z x1 = x[:, D:] 2025-05-07T20:33:01.6394538Z 2025-05-07T20:33:01.6394778Z if contiguous: 2025-05-07T20:33:01.6395008Z x0 = x0.contiguous() 2025-05-07T20:33:01.6395271Z x1 = x1.contiguous() 2025-05-07T20:33:01.6395513Z 2025-05-07T20:33:01.6395701Z if scale_ub is not None: 2025-05-07T20:33:01.6395979Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:01.6396321Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:01.6396703Z ) 2025-05-07T20:33:01.6396901Z else: 2025-05-07T20:33:01.6397117Z scale_ub_tensor = None 2025-05-07T20:33:01.6397370Z 2025-05-07T20:33:01.6397606Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:01.6397935Z op = silu_mul_quant 2025-05-07T20:33:01.6398186Z if compiled: 2025-05-07T20:33:01.6398444Z op = torch.compile(op) 2025-05-07T20:33:01.6398754Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:01.6399042Z 2025-05-07T20:33:01.6399233Z > y_fp8, y_scale = fn() 2025-05-07T20:33:01.6399411Z 2025-05-07T20:33:01.6399511Z moe/activation_test.py:117: 2025-05-07T20:33:01.6399824Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.6400170Z moe/activation_test.py:115: in fn 2025-05-07T20:33:01.6400466Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:01.6401066Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:01.6401657Z return fn(*args, **kwargs) 2025-05-07T20:33:01.6402364Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:01.6403105Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:01.6403673Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:01.6404397Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:01.6405104Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:01.6405671Z kernel = self.compile( 2025-05-07T20:33:01.6406243Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:01.6406937Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:01.6407349Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.6407588Z 2025-05-07T20:33:01.6407855Z self = 2025-05-07T20:33:01.6409033Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:01.6410538Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd1b90549d0>} 2025-05-07T20:33:01.6412015Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:01.6413123Z context = 2025-05-07T20:33:01.6413428Z 2025-05-07T20:33:01.6413607Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:01.6414155Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:01.6414646Z module_map=module_map) 2025-05-07T20:33:01.6415023Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:01.6415429Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:01.6415692Z E ^ 2025-05-07T20:33:01.6416224Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:01.6416714Z 2025-05-07T20:33:01.6417169Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:01.6417725Z 2025-05-07T20:33:01.6417871Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:01.6418299Z self=, 2025-05-07T20:33:01.6418723Z T=1, 2025-05-07T20:33:01.6418907Z D=5120, 2025-05-07T20:33:01.6419097Z scale_ub=None, 2025-05-07T20:33:01.6419325Z contiguous=False, 2025-05-07T20:33:01.6419558Z compiled=True, 2025-05-07T20:33:01.6419762Z ) 2025-05-07T20:33:01.7227011Z self = 2025-05-07T20:33:01.7227873Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:01.7228267Z 2025-05-07T20:33:01.7228388Z @given( 2025-05-07T20:33:01.7228629Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:01.7228956Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:01.7229272Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:01.7229608Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:01.7230016Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:01.7230319Z ) 2025-05-07T20:33:01.7230671Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:01.7231146Z def test_silu_mul_quant( 2025-05-07T20:33:01.7231392Z self, 2025-05-07T20:33:01.7231576Z T: int, 2025-05-07T20:33:01.7231774Z D: int, 2025-05-07T20:33:01.7231991Z scale_ub: Optional[float], 2025-05-07T20:33:01.7232259Z contiguous: bool, 2025-05-07T20:33:01.7232496Z compiled: bool, 2025-05-07T20:33:01.7232723Z ) -> None: 2025-05-07T20:33:01.7232928Z torch.manual_seed(2025) 2025-05-07T20:33:01.7233172Z 2025-05-07T20:33:01.7233443Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:01.7233789Z 2025-05-07T20:33:01.7233974Z x_sign = torch.sign(x) 2025-05-07T20:33:01.7234270Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:01.7234590Z x = x_sign * x_clamp 2025-05-07T20:33:01.7234833Z x0 = x[:, :D] 2025-05-07T20:33:01.7235054Z x1 = x[:, D:] 2025-05-07T20:33:01.7235258Z 2025-05-07T20:33:01.7235435Z if contiguous: 2025-05-07T20:33:01.7235778Z x0 = x0.contiguous() 2025-05-07T20:33:01.7236041Z x1 = x1.contiguous() 2025-05-07T20:33:01.7236278Z 2025-05-07T20:33:01.7236471Z if scale_ub is not None: 2025-05-07T20:33:01.7236748Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:01.7237088Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:01.7237407Z ) 2025-05-07T20:33:01.7237596Z else: 2025-05-07T20:33:01.7237798Z scale_ub_tensor = None 2025-05-07T20:33:01.7238053Z 2025-05-07T20:33:01.7238284Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:01.7238602Z op = silu_mul_quant 2025-05-07T20:33:01.7238856Z if compiled: 2025-05-07T20:33:01.7239103Z op = torch.compile(op) 2025-05-07T20:33:01.7239409Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:01.7239686Z 2025-05-07T20:33:01.7239882Z y_fp8, y_scale = fn() 2025-05-07T20:33:01.7240177Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:01.7240467Z 2025-05-07T20:33:01.7240706Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:01.7241053Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:01.7241348Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:01.7241734Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:01.7242109Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:01.7242480Z 2025-05-07T20:33:01.7242682Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:01.7242887Z 2025-05-07T20:33:01.7242984Z moe/activation_test.py:126: 2025-05-07T20:33:01.7243288Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.7243691Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:01.7244024Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:01.7244875Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:01.7245683Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:01.7246261Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:01.7246995Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:01.7247733Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:01.7248499Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:01.7249301Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:33:01.7250106Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:01.7250892Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:01.7251572Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:01.7252217Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:01.7252771Z fn() 2025-05-07T20:33:01.7253305Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:01.7253927Z self.fn.run( 2025-05-07T20:33:01.7254419Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:01.7254986Z kernel = self.compile( 2025-05-07T20:33:01.7255550Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:01.7256247Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:01.7256710Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.7256955Z 2025-05-07T20:33:01.7257176Z self = 2025-05-07T20:33:01.7258346Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:01.7259864Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd1b97ede50>} 2025-05-07T20:33:01.7261384Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:01.7262505Z context = 2025-05-07T20:33:01.7262814Z 2025-05-07T20:33:01.7262980Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:01.7263527Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:01.7264015Z module_map=module_map) 2025-05-07T20:33:01.7264465Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:01.7264872Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:01.7265150Z E ^ 2025-05-07T20:33:01.7265640Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:01.7266122Z 2025-05-07T20:33:01.7266567Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:01.7267173Z 2025-05-07T20:33:01.7267275Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:01.7267709Z self=, 2025-05-07T20:33:01.7268127Z T=1, 2025-05-07T20:33:01.7268302Z D=5120, 2025-05-07T20:33:01.7268489Z scale_ub=None, 2025-05-07T20:33:01.7268701Z contiguous=True, 2025-05-07T20:33:01.7268925Z compiled=False, 2025-05-07T20:33:01.7269133Z ) 2025-05-07T20:33:02.0805392Z self = 2025-05-07T20:33:02.0806202Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:02.0806587Z 2025-05-07T20:33:02.0806706Z @given( 2025-05-07T20:33:02.0807019Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:02.0807434Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:02.0807752Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:02.0808095Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:02.0808450Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:02.0808748Z ) 2025-05-07T20:33:02.0809112Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:02.0809588Z def test_silu_mul_quant( 2025-05-07T20:33:02.0809838Z self, 2025-05-07T20:33:02.0810037Z T: int, 2025-05-07T20:33:02.0810235Z D: int, 2025-05-07T20:33:02.0810452Z scale_ub: Optional[float], 2025-05-07T20:33:02.0810736Z contiguous: bool, 2025-05-07T20:33:02.0810977Z compiled: bool, 2025-05-07T20:33:02.0811221Z ) -> None: 2025-05-07T20:33:02.0811442Z torch.manual_seed(2025) 2025-05-07T20:33:02.0811682Z 2025-05-07T20:33:02.0811957Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:02.0812313Z 2025-05-07T20:33:02.0812501Z x_sign = torch.sign(x) 2025-05-07T20:33:02.0812799Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:02.0813118Z x = x_sign * x_clamp 2025-05-07T20:33:02.0813357Z x0 = x[:, :D] 2025-05-07T20:33:02.0813575Z x1 = x[:, D:] 2025-05-07T20:33:02.0813907Z 2025-05-07T20:33:02.0814095Z if contiguous: 2025-05-07T20:33:02.0814327Z x0 = x0.contiguous() 2025-05-07T20:33:02.0814592Z x1 = x1.contiguous() 2025-05-07T20:33:02.0814828Z 2025-05-07T20:33:02.0815022Z if scale_ub is not None: 2025-05-07T20:33:02.0815302Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:02.0815650Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:02.0815963Z ) 2025-05-07T20:33:02.0816157Z else: 2025-05-07T20:33:02.0816368Z scale_ub_tensor = None 2025-05-07T20:33:02.0816615Z 2025-05-07T20:33:02.0816850Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:02.0817176Z op = silu_mul_quant 2025-05-07T20:33:02.0817426Z if compiled: 2025-05-07T20:33:02.0817677Z op = torch.compile(op) 2025-05-07T20:33:02.0817982Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:02.0818260Z 2025-05-07T20:33:02.0818459Z > y_fp8, y_scale = fn() 2025-05-07T20:33:02.0818627Z 2025-05-07T20:33:02.0818761Z moe/activation_test.py:117: 2025-05-07T20:33:02.0819064Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:02.0819406Z moe/activation_test.py:115: in fn 2025-05-07T20:33:02.0819769Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:02.0820519Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:02.0821355Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:02.0821949Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:02.0822687Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:02.0823457Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:02.0824033Z kernel = self.compile( 2025-05-07T20:33:02.0824600Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:02.0825292Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:02.0825708Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:02.0825949Z 2025-05-07T20:33:02.0826167Z self = 2025-05-07T20:33:02.0827337Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:02.0828851Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd1b906adc0>} 2025-05-07T20:33:02.0830461Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:02.0831568Z context = 2025-05-07T20:33:02.0831871Z 2025-05-07T20:33:02.0832041Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:02.0832589Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:02.0833078Z module_map=module_map) 2025-05-07T20:33:02.0833447Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:02.0833813Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:02.0834077Z E ^ 2025-05-07T20:33:02.0834576Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:02.0835068Z 2025-05-07T20:33:02.0835567Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:02.0836127Z 2025-05-07T20:33:02.0836229Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:02.0836655Z self=, 2025-05-07T20:33:02.0837077Z T=128, 2025-05-07T20:33:02.0837261Z D=5120, 2025-05-07T20:33:02.0837449Z scale_ub=None, 2025-05-07T20:33:02.0837662Z contiguous=False, 2025-05-07T20:33:02.0843632Z compiled=True, 2025-05-07T20:33:02.0843889Z ) 2025-05-07T20:33:02.0844238Z self = 2025-05-07T20:33:02.0844769Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:02.0845075Z 2025-05-07T20:33:02.0845158Z @given( 2025-05-07T20:33:02.0845400Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:02.0845745Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:02.0846068Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:02.0846420Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:02.0846769Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:02.0847062Z ) 2025-05-07T20:33:02.0847501Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:02.0848014Z def test_silu_mul_quant( 2025-05-07T20:33:02.0848262Z self, 2025-05-07T20:33:02.0848465Z T: int, 2025-05-07T20:33:02.0848670Z D: int, 2025-05-07T20:33:02.0848885Z scale_ub: Optional[float], 2025-05-07T20:33:02.0849166Z contiguous: bool, 2025-05-07T20:33:02.0849413Z compiled: bool, 2025-05-07T20:33:02.0849690Z ) -> None: 2025-05-07T20:33:02.0849918Z torch.manual_seed(2025) 2025-05-07T20:33:02.0850173Z 2025-05-07T20:33:02.0850461Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:02.0850826Z 2025-05-07T20:33:02.0851035Z x_sign = torch.sign(x) 2025-05-07T20:33:02.0851375Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:02.0851723Z x = x_sign * x_clamp 2025-05-07T20:33:02.0851979Z x0 = x[:, :D] 2025-05-07T20:33:02.0852210Z x1 = x[:, D:] 2025-05-07T20:33:02.0852427Z 2025-05-07T20:33:02.0852624Z if contiguous: 2025-05-07T20:33:02.0852869Z x0 = x0.contiguous() 2025-05-07T20:33:02.0853139Z x1 = x1.contiguous() 2025-05-07T20:33:02.0853397Z 2025-05-07T20:33:02.0853610Z if scale_ub is not None: 2025-05-07T20:33:02.0853896Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:02.0854247Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:02.0854567Z ) 2025-05-07T20:33:02.0854760Z else: 2025-05-07T20:33:02.0854977Z scale_ub_tensor = None 2025-05-07T20:33:02.0855244Z 2025-05-07T20:33:02.0855474Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:02.0855802Z op = silu_mul_quant 2025-05-07T20:33:02.0856059Z if compiled: 2025-05-07T20:33:02.0856312Z op = torch.compile(op) 2025-05-07T20:33:02.0856613Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:02.0856902Z 2025-05-07T20:33:02.0857101Z > y_fp8, y_scale = fn() 2025-05-07T20:33:02.0857268Z 2025-05-07T20:33:02.0857371Z moe/activation_test.py:117: 2025-05-07T20:33:02.0857674Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:02.0858019Z moe/activation_test.py:115: in fn 2025-05-07T20:33:02.0858305Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:02.0858897Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:02.0859502Z return fn(*args, **kwargs) 2025-05-07T20:33:02.0860268Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:02.0861010Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:02.0861579Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:02.0862316Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:02.0863021Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:02.0863590Z kernel = self.compile( 2025-05-07T20:33:02.0864161Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:02.0864858Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:02.0865268Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:02.0865517Z 2025-05-07T20:33:02.0865733Z self = 2025-05-07T20:33:02.0866903Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:02.0868454Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd1b88a7040>} 2025-05-07T20:33:02.0870070Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:02.0871180Z context = 2025-05-07T20:33:02.0871531Z 2025-05-07T20:33:02.0871699Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:02.0872252Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:02.0872742Z module_map=module_map) 2025-05-07T20:33:02.0873122Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:02.0873483Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:02.0873756Z E ^ 2025-05-07T20:33:02.0874245Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:02.0874743Z 2025-05-07T20:33:02.0875194Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:02.0875749Z 2025-05-07T20:33:02.0875858Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:02.0876289Z self=, 2025-05-07T20:33:02.0876706Z T=128, 2025-05-07T20:33:02.0876894Z D=7168, 2025-05-07T20:33:02.0877088Z scale_ub=1200.0, 2025-05-07T20:33:02.0877315Z contiguous=False, 2025-05-07T20:33:02.0877546Z compiled=False, 2025-05-07T20:33:02.0877756Z ) 2025-05-07T20:33:02.2388806Z self = 2025-05-07T20:33:02.2389697Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:02.2390212Z 2025-05-07T20:33:02.2390324Z @given( 2025-05-07T20:33:02.2390665Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:02.2391121Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:02.2391460Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:02.2391822Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:02.2392162Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:02.2392468Z ) 2025-05-07T20:33:02.2392830Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:02.2393293Z def test_silu_mul_quant( 2025-05-07T20:33:02.2393749Z self, 2025-05-07T20:33:02.2393951Z T: int, 2025-05-07T20:33:02.2394155Z D: int, 2025-05-07T20:33:02.2394380Z scale_ub: Optional[float], 2025-05-07T20:33:02.2394655Z contiguous: bool, 2025-05-07T20:33:02.2394904Z compiled: bool, 2025-05-07T20:33:02.2395140Z ) -> None: 2025-05-07T20:33:02.2395361Z torch.manual_seed(2025) 2025-05-07T20:33:02.2395618Z 2025-05-07T20:33:02.2395904Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:02.2396257Z 2025-05-07T20:33:02.2396456Z x_sign = torch.sign(x) 2025-05-07T20:33:02.2396755Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:02.2397073Z x = x_sign * x_clamp 2025-05-07T20:33:02.2397320Z x0 = x[:, :D] 2025-05-07T20:33:02.2397544Z x1 = x[:, D:] 2025-05-07T20:33:02.2397755Z 2025-05-07T20:33:02.2397950Z if contiguous: 2025-05-07T20:33:02.2398189Z x0 = x0.contiguous() 2025-05-07T20:33:02.2398462Z x1 = x1.contiguous() 2025-05-07T20:33:02.2398706Z 2025-05-07T20:33:02.2398903Z if scale_ub is not None: 2025-05-07T20:33:02.2399188Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:02.2399532Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:02.2399851Z ) 2025-05-07T20:33:02.2400123Z else: 2025-05-07T20:33:02.2400333Z scale_ub_tensor = None 2025-05-07T20:33:02.2400642Z 2025-05-07T20:33:02.2400882Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:02.2401211Z op = silu_mul_quant 2025-05-07T20:33:02.2401470Z if compiled: 2025-05-07T20:33:02.2401729Z op = torch.compile(op) 2025-05-07T20:33:02.2402036Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:02.2402381Z 2025-05-07T20:33:02.2402567Z > y_fp8, y_scale = fn() 2025-05-07T20:33:02.2402732Z 2025-05-07T20:33:02.2402837Z moe/activation_test.py:117: 2025-05-07T20:33:02.2403134Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:02.2403480Z moe/activation_test.py:115: in fn 2025-05-07T20:33:02.2403765Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:02.2404498Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:02.2405241Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:02.2405809Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:02.2406539Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:02.2407247Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:02.2407816Z kernel = self.compile( 2025-05-07T20:33:02.2408389Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:02.2409080Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:02.2409492Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:02.2409737Z 2025-05-07T20:33:02.2409955Z self = 2025-05-07T20:33:02.2411123Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:02.2412620Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd1b88a7c10>} 2025-05-07T20:33:02.2414139Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:02.2415242Z context = 2025-05-07T20:33:02.2415544Z 2025-05-07T20:33:02.2415717Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:02.2416266Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:02.2416763Z module_map=module_map) 2025-05-07T20:33:02.2417148Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:02.2417514Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:02.2417779Z E ^ 2025-05-07T20:33:02.2418274Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:02.2418767Z 2025-05-07T20:33:02.2419225Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:02.2419785Z 2025-05-07T20:33:02.2419893Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:02.2420322Z self=, 2025-05-07T20:33:02.2420744Z T=128, 2025-05-07T20:33:02.2420927Z D=5120, 2025-05-07T20:33:02.2421110Z scale_ub=None, 2025-05-07T20:33:02.2421369Z contiguous=False, 2025-05-07T20:33:02.2421592Z compiled=False, 2025-05-07T20:33:02.2421823Z ) 2025-05-07T20:33:02.2422145Z self = 2025-05-07T20:33:02.2422661Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:02.2422944Z 2025-05-07T20:33:02.2423028Z @given( 2025-05-07T20:33:02.2423257Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:02.2423622Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:02.2423940Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:02.2424281Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:02.2424618Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:02.2424910Z ) 2025-05-07T20:33:02.2425271Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:02.2425734Z def test_silu_mul_quant( 2025-05-07T20:33:02.2425982Z self, 2025-05-07T20:33:02.2426177Z T: int, 2025-05-07T20:33:02.2426386Z D: int, 2025-05-07T20:33:02.2426618Z scale_ub: Optional[float], 2025-05-07T20:33:02.2426894Z contiguous: bool, 2025-05-07T20:33:02.2427141Z compiled: bool, 2025-05-07T20:33:02.2427371Z ) -> None: 2025-05-07T20:33:02.2427588Z torch.manual_seed(2025) 2025-05-07T20:33:02.2427844Z 2025-05-07T20:33:02.2428122Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:02.2428476Z 2025-05-07T20:33:02.2428667Z x_sign = torch.sign(x) 2025-05-07T20:33:02.2428971Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:02.2429288Z x = x_sign * x_clamp 2025-05-07T20:33:02.2429542Z x0 = x[:, :D] 2025-05-07T20:33:02.2429759Z x1 = x[:, D:] 2025-05-07T20:33:02.2430078Z 2025-05-07T20:33:02.2430265Z if contiguous: 2025-05-07T20:33:02.2430503Z x0 = x0.contiguous() 2025-05-07T20:33:02.2430765Z x1 = x1.contiguous() 2025-05-07T20:33:02.2431015Z 2025-05-07T20:33:02.2431210Z if scale_ub is not None: 2025-05-07T20:33:02.2431480Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:02.2431824Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:02.2432138Z ) 2025-05-07T20:33:02.2432326Z else: 2025-05-07T20:33:02.2432526Z scale_ub_tensor = None 2025-05-07T20:33:02.2432786Z 2025-05-07T20:33:02.2433012Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:02.2433333Z op = silu_mul_quant 2025-05-07T20:33:02.2433580Z if compiled: 2025-05-07T20:33:02.2433896Z op = torch.compile(op) 2025-05-07T20:33:02.2434194Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:02.2434472Z 2025-05-07T20:33:02.2434669Z > y_fp8, y_scale = fn() 2025-05-07T20:33:02.2434838Z 2025-05-07T20:33:02.2434939Z moe/activation_test.py:117: 2025-05-07T20:33:02.2435250Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:02.2435602Z moe/activation_test.py:115: in fn 2025-05-07T20:33:02.2435887Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:02.2436626Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:02.2437371Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:02.2437940Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:02.2438670Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:02.2439380Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:02.2439944Z kernel = self.compile( 2025-05-07T20:33:02.2440558Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:02.2441253Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:02.2441704Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:02.2441943Z 2025-05-07T20:33:02.2442160Z self = 2025-05-07T20:33:02.2443322Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:02.2444858Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd1b887c310>} 2025-05-07T20:33:02.2446324Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:02.2447434Z context = 2025-05-07T20:33:02.2447743Z 2025-05-07T20:33:02.2447921Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:02.2448467Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:02.2448953Z module_map=module_map) 2025-05-07T20:33:02.2449333Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:02.2449698Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:02.2449965Z E ^ 2025-05-07T20:33:02.2450463Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:02.2450954Z 2025-05-07T20:33:02.2451409Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:02.2451970Z 2025-05-07T20:33:02.2452075Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:02.2452508Z self=, 2025-05-07T20:33:02.2452928Z T=128, 2025-05-07T20:33:02.2453115Z D=5120, 2025-05-07T20:33:02.2453308Z scale_ub=1200.0, 2025-05-07T20:33:02.2453532Z contiguous=True, 2025-05-07T20:33:02.2453761Z compiled=False, 2025-05-07T20:33:02.2453963Z ) 2025-05-07T20:33:02.4728988Z self = 2025-05-07T20:33:02.4729801Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:02.4730325Z 2025-05-07T20:33:02.4730446Z @given( 2025-05-07T20:33:02.4730771Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:02.4731100Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:02.4731422Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:02.4731764Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:02.4732108Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:02.4732410Z ) 2025-05-07T20:33:02.4732769Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:02.4733241Z def test_silu_mul_quant( 2025-05-07T20:33:02.4733492Z self, 2025-05-07T20:33:02.4733690Z T: int, 2025-05-07T20:33:02.4733890Z D: int, 2025-05-07T20:33:02.4734116Z scale_ub: Optional[float], 2025-05-07T20:33:02.4734396Z contiguous: bool, 2025-05-07T20:33:02.4734634Z compiled: bool, 2025-05-07T20:33:02.4734866Z ) -> None: 2025-05-07T20:33:02.4735089Z torch.manual_seed(2025) 2025-05-07T20:33:02.4735338Z 2025-05-07T20:33:02.4735615Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:02.4735974Z 2025-05-07T20:33:02.4736166Z x_sign = torch.sign(x) 2025-05-07T20:33:02.4736539Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:02.4736888Z x = x_sign * x_clamp 2025-05-07T20:33:02.4737200Z x0 = x[:, :D] 2025-05-07T20:33:02.4737427Z x1 = x[:, D:] 2025-05-07T20:33:02.4737644Z 2025-05-07T20:33:02.4737831Z if contiguous: 2025-05-07T20:33:02.4738071Z x0 = x0.contiguous() 2025-05-07T20:33:02.4738353Z x1 = x1.contiguous() 2025-05-07T20:33:02.4738605Z 2025-05-07T20:33:02.4738802Z if scale_ub is not None: 2025-05-07T20:33:02.4739175Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:02.4739515Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:02.4739827Z ) 2025-05-07T20:33:02.4740024Z else: 2025-05-07T20:33:02.4740233Z scale_ub_tensor = None 2025-05-07T20:33:02.4740485Z 2025-05-07T20:33:02.4740714Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:02.4741036Z op = silu_mul_quant 2025-05-07T20:33:02.4741291Z if compiled: 2025-05-07T20:33:02.4741550Z op = torch.compile(op) 2025-05-07T20:33:02.4741852Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:02.4742127Z 2025-05-07T20:33:02.4742319Z > y_fp8, y_scale = fn() 2025-05-07T20:33:02.4742483Z 2025-05-07T20:33:02.4742587Z moe/activation_test.py:117: 2025-05-07T20:33:02.4742881Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:02.4743235Z moe/activation_test.py:115: in fn 2025-05-07T20:33:02.4743519Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:02.4744258Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:02.4744996Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:02.4745561Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:02.4746289Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:02.4746994Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:02.4747562Z kernel = self.compile( 2025-05-07T20:33:02.4748132Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:02.4748833Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:02.4749245Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:02.4749489Z 2025-05-07T20:33:02.4749748Z self = 2025-05-07T20:33:02.4751070Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:02.4752576Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd1b887cee0>} 2025-05-07T20:33:02.4754040Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:02.4755142Z context = 2025-05-07T20:33:02.4755451Z 2025-05-07T20:33:02.4755616Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:02.4756163Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:02.4756645Z module_map=module_map) 2025-05-07T20:33:02.4757015Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:02.4757382Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:02.4757644Z E ^ 2025-05-07T20:33:02.4758171Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:02.4758699Z 2025-05-07T20:33:02.4759148Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:02.4759702Z 2025-05-07T20:33:02.4759810Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:02.4760270Z self=, 2025-05-07T20:33:02.4760687Z T=1, 2025-05-07T20:33:02.4760869Z D=7168, 2025-05-07T20:33:02.4761061Z scale_ub=1200.0, 2025-05-07T20:33:02.4761278Z contiguous=True, 2025-05-07T20:33:02.4761502Z compiled=True, 2025-05-07T20:33:02.4761739Z ) 2025-05-07T20:33:02.4762072Z self = 2025-05-07T20:33:02.4762581Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:02.4762853Z 2025-05-07T20:33:02.4762939Z @given( 2025-05-07T20:33:02.4763166Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:02.4763490Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:02.4763805Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:02.4764144Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:02.4764475Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:02.4764771Z ) 2025-05-07T20:33:02.4765129Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:02.4765589Z def test_silu_mul_quant( 2025-05-07T20:33:02.4765837Z self, 2025-05-07T20:33:02.4766030Z T: int, 2025-05-07T20:33:02.4766225Z D: int, 2025-05-07T20:33:02.4766444Z scale_ub: Optional[float], 2025-05-07T20:33:02.4766716Z contiguous: bool, 2025-05-07T20:33:02.4766955Z compiled: bool, 2025-05-07T20:33:02.4767180Z ) -> None: 2025-05-07T20:33:02.4767393Z torch.manual_seed(2025) 2025-05-07T20:33:02.4767635Z 2025-05-07T20:33:02.4767911Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:02.4768266Z 2025-05-07T20:33:02.4768452Z x_sign = torch.sign(x) 2025-05-07T20:33:02.4768746Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:02.4769065Z x = x_sign * x_clamp 2025-05-07T20:33:02.4769310Z x0 = x[:, :D] 2025-05-07T20:33:02.4769522Z x1 = x[:, D:] 2025-05-07T20:33:02.4769728Z 2025-05-07T20:33:02.4769913Z if contiguous: 2025-05-07T20:33:02.4770141Z x0 = x0.contiguous() 2025-05-07T20:33:02.4770452Z x1 = x1.contiguous() 2025-05-07T20:33:02.4770693Z 2025-05-07T20:33:02.4770877Z if scale_ub is not None: 2025-05-07T20:33:02.4771155Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:02.4771493Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:02.4771801Z ) 2025-05-07T20:33:02.4772001Z else: 2025-05-07T20:33:02.4772207Z scale_ub_tensor = None 2025-05-07T20:33:02.4772462Z 2025-05-07T20:33:02.4772703Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:02.4773025Z op = silu_mul_quant 2025-05-07T20:33:02.4773282Z if compiled: 2025-05-07T20:33:02.4773538Z op = torch.compile(op) 2025-05-07T20:33:02.4779288Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:02.4779604Z 2025-05-07T20:33:02.4779806Z > y_fp8, y_scale = fn() 2025-05-07T20:33:02.4779988Z 2025-05-07T20:33:02.4780094Z moe/activation_test.py:117: 2025-05-07T20:33:02.4780420Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:02.4780769Z moe/activation_test.py:115: in fn 2025-05-07T20:33:02.4781067Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:02.4781738Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:02.4782349Z return fn(*args, **kwargs) 2025-05-07T20:33:02.4783337Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:02.4784086Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:02.4784658Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:02.4785469Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:02.4786193Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:02.4786770Z kernel = self.compile( 2025-05-07T20:33:02.4787350Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:02.4788048Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:02.4788472Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:02.4788716Z 2025-05-07T20:33:02.4788940Z self = 2025-05-07T20:33:02.4790216Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:02.4791745Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd1b8cd4940>} 2025-05-07T20:33:02.4793237Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:02.4794361Z context = 2025-05-07T20:33:02.4794672Z 2025-05-07T20:33:02.4794856Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:02.4795412Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:02.4795916Z module_map=module_map) 2025-05-07T20:33:02.4796309Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:02.4796694Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:02.4796968Z E ^ 2025-05-07T20:33:02.4797475Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:02.4798038Z 2025-05-07T20:33:02.4798494Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:02.4799052Z 2025-05-07T20:33:02.4799164Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:02.4799593Z self=, 2025-05-07T20:33:02.4800028Z T=1, 2025-05-07T20:33:02.4800219Z D=7168, 2025-05-07T20:33:02.4800414Z scale_ub=1200.0, 2025-05-07T20:33:02.4800644Z contiguous=False, 2025-05-07T20:33:02.4800876Z compiled=True, 2025-05-07T20:33:02.4801080Z ) 2025-05-07T20:33:02.8113701Z self = 2025-05-07T20:33:02.8114520Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:02.8114909Z 2025-05-07T20:33:02.8115028Z @given( 2025-05-07T20:33:02.8115373Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:02.8115817Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:02.8116247Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:02.8116593Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:02.8116932Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:02.8117230Z ) 2025-05-07T20:33:02.8117708Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:02.8118232Z def test_silu_mul_quant( 2025-05-07T20:33:02.8118483Z self, 2025-05-07T20:33:02.8118681Z T: int, 2025-05-07T20:33:02.8118880Z D: int, 2025-05-07T20:33:02.8119099Z scale_ub: Optional[float], 2025-05-07T20:33:02.8119374Z contiguous: bool, 2025-05-07T20:33:02.8119615Z compiled: bool, 2025-05-07T20:33:02.8119907Z ) -> None: 2025-05-07T20:33:02.8120130Z torch.manual_seed(2025) 2025-05-07T20:33:02.8120379Z 2025-05-07T20:33:02.8120656Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:02.8121017Z 2025-05-07T20:33:02.8121219Z x_sign = torch.sign(x) 2025-05-07T20:33:02.8121519Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:02.8121849Z x = x_sign * x_clamp 2025-05-07T20:33:02.8122102Z x0 = x[:, :D] 2025-05-07T20:33:02.8122333Z x1 = x[:, D:] 2025-05-07T20:33:02.8122545Z 2025-05-07T20:33:02.8122739Z if contiguous: 2025-05-07T20:33:02.8122975Z x0 = x0.contiguous() 2025-05-07T20:33:02.8123245Z x1 = x1.contiguous() 2025-05-07T20:33:02.8123495Z 2025-05-07T20:33:02.8123691Z if scale_ub is not None: 2025-05-07T20:33:02.8123977Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:02.8124334Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:02.8124665Z ) 2025-05-07T20:33:02.8124858Z else: 2025-05-07T20:33:02.8125078Z scale_ub_tensor = None 2025-05-07T20:33:02.8125347Z 2025-05-07T20:33:02.8125582Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:02.8125914Z op = silu_mul_quant 2025-05-07T20:33:02.8126174Z if compiled: 2025-05-07T20:33:02.8126425Z op = torch.compile(op) 2025-05-07T20:33:02.8126731Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:02.8127021Z 2025-05-07T20:33:02.8127210Z > y_fp8, y_scale = fn() 2025-05-07T20:33:02.8127389Z 2025-05-07T20:33:02.8127489Z moe/activation_test.py:117: 2025-05-07T20:33:02.8127800Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:02.8128155Z moe/activation_test.py:115: in fn 2025-05-07T20:33:02.8128440Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:02.8129042Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:02.8129653Z return fn(*args, **kwargs) 2025-05-07T20:33:02.8130434Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:02.8131184Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:02.8131753Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:02.8132490Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:02.8133196Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:02.8133769Z kernel = self.compile( 2025-05-07T20:33:02.8134344Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:02.8135049Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:02.8135473Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:02.8135726Z 2025-05-07T20:33:02.8135946Z self = 2025-05-07T20:33:02.8137126Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:02.8138682Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd1b8bfb5e0>} 2025-05-07T20:33:02.8140227Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:02.8141377Z context = 2025-05-07T20:33:02.8141682Z 2025-05-07T20:33:02.8141862Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:02.8142412Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:02.8142903Z module_map=module_map) 2025-05-07T20:33:02.8143305Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:02.8143674Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:02.8143941Z E ^ 2025-05-07T20:33:02.8144443Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:02.8144947Z 2025-05-07T20:33:02.8145400Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:02.8145963Z 2025-05-07T20:33:02.8146075Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:02.8146509Z self=, 2025-05-07T20:33:02.8146932Z T=1, 2025-05-07T20:33:02.8147119Z D=7168, 2025-05-07T20:33:02.8147313Z scale_ub=None, 2025-05-07T20:33:02.8147536Z contiguous=False, 2025-05-07T20:33:02.8147765Z compiled=True, 2025-05-07T20:33:02.8147968Z ) 2025-05-07T20:33:02.9278111Z self = 2025-05-07T20:33:02.9278907Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:02.9279293Z 2025-05-07T20:33:02.9279383Z @given( 2025-05-07T20:33:02.9279626Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:02.9279956Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:02.9280268Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:02.9280614Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:02.9280957Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:02.9281251Z ) 2025-05-07T20:33:02.9281620Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:02.9282194Z def test_silu_mul_quant( 2025-05-07T20:33:02.9282448Z self, 2025-05-07T20:33:02.9282639Z T: int, 2025-05-07T20:33:02.9283081Z D: int, 2025-05-07T20:33:02.9283303Z scale_ub: Optional[float], 2025-05-07T20:33:02.9283576Z contiguous: bool, 2025-05-07T20:33:02.9283820Z compiled: bool, 2025-05-07T20:33:02.9284048Z ) -> None: 2025-05-07T20:33:02.9284261Z torch.manual_seed(2025) 2025-05-07T20:33:02.9284516Z 2025-05-07T20:33:02.9284790Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:02.9285143Z 2025-05-07T20:33:02.9285333Z x_sign = torch.sign(x) 2025-05-07T20:33:02.9285628Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:02.9285937Z x = x_sign * x_clamp 2025-05-07T20:33:02.9286179Z x0 = x[:, :D] 2025-05-07T20:33:02.9286394Z x1 = x[:, D:] 2025-05-07T20:33:02.9286596Z 2025-05-07T20:33:02.9286782Z if contiguous: 2025-05-07T20:33:02.9287013Z x0 = x0.contiguous() 2025-05-07T20:33:02.9287269Z x1 = x1.contiguous() 2025-05-07T20:33:02.9287513Z 2025-05-07T20:33:02.9287699Z if scale_ub is not None: 2025-05-07T20:33:02.9287973Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:02.9288307Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:02.9288740Z ) 2025-05-07T20:33:02.9288932Z else: 2025-05-07T20:33:02.9289189Z scale_ub_tensor = None 2025-05-07T20:33:02.9289449Z 2025-05-07T20:33:02.9289679Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:02.9290000Z op = silu_mul_quant 2025-05-07T20:33:02.9290256Z if compiled: 2025-05-07T20:33:02.9290502Z op = torch.compile(op) 2025-05-07T20:33:02.9290860Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:02.9291139Z 2025-05-07T20:33:02.9291330Z y_fp8, y_scale = fn() 2025-05-07T20:33:02.9291616Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:02.9291918Z 2025-05-07T20:33:02.9292157Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:02.9292510Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:02.9292812Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:02.9293135Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:02.9293509Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:02.9293833Z 2025-05-07T20:33:02.9294038Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:02.9294256Z 2025-05-07T20:33:02.9294358Z moe/activation_test.py:126: 2025-05-07T20:33:02.9294664Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:02.9295015Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:02.9295351Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:02.9296204Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:02.9297013Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:02.9297591Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:02.9298320Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:02.9299059Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:02.9299825Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:02.9300631Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:33:02.9301446Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:02.9302332Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:02.9303014Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:02.9303657Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:02.9304207Z fn() 2025-05-07T20:33:02.9304737Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:02.9305363Z self.fn.run( 2025-05-07T20:33:02.9305859Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:02.9306424Z kernel = self.compile( 2025-05-07T20:33:02.9306993Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:02.9307699Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:02.9308121Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:02.9308364Z 2025-05-07T20:33:02.9308583Z self = 2025-05-07T20:33:02.9309935Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:02.9311487Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd1b8a4b160>} 2025-05-07T20:33:02.9313006Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:02.9314161Z context = 2025-05-07T20:33:02.9314463Z 2025-05-07T20:33:02.9314633Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:02.9315183Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:02.9315671Z module_map=module_map) 2025-05-07T20:33:02.9316051Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:02.9316415Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:02.9316696Z E ^ 2025-05-07T20:33:02.9317182Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:02.9317669Z 2025-05-07T20:33:02.9318115Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:02.9318678Z 2025-05-07T20:33:02.9318778Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:02.9319200Z self=, 2025-05-07T20:33:02.9319616Z T=1, 2025-05-07T20:33:02.9319789Z D=5120, 2025-05-07T20:33:02.9319979Z scale_ub=1200.0, 2025-05-07T20:33:02.9320198Z contiguous=False, 2025-05-07T20:33:02.9320418Z compiled=True, 2025-05-07T20:33:02.9320620Z ) 2025-05-07T20:33:03.1303575Z self = 2025-05-07T20:33:03.1304388Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:03.1304791Z 2025-05-07T20:33:03.1304898Z @given( 2025-05-07T20:33:03.1305211Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:03.1305597Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:03.1305912Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:03.1306262Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:03.1306595Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:03.1306887Z ) 2025-05-07T20:33:03.1307399Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:03.1308050Z def test_silu_mul_quant( 2025-05-07T20:33:03.1308290Z self, 2025-05-07T20:33:03.1308480Z T: int, 2025-05-07T20:33:03.1308674Z D: int, 2025-05-07T20:33:03.1308885Z scale_ub: Optional[float], 2025-05-07T20:33:03.1309156Z contiguous: bool, 2025-05-07T20:33:03.1309396Z compiled: bool, 2025-05-07T20:33:03.1309617Z ) -> None: 2025-05-07T20:33:03.1309880Z torch.manual_seed(2025) 2025-05-07T20:33:03.1310129Z 2025-05-07T20:33:03.1310397Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:03.1310754Z 2025-05-07T20:33:03.1310947Z x_sign = torch.sign(x) 2025-05-07T20:33:03.1311233Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:03.1311575Z x = x_sign * x_clamp 2025-05-07T20:33:03.1311838Z x0 = x[:, :D] 2025-05-07T20:33:03.1312045Z x1 = x[:, D:] 2025-05-07T20:33:03.1312254Z 2025-05-07T20:33:03.1312437Z if contiguous: 2025-05-07T20:33:03.1312664Z x0 = x0.contiguous() 2025-05-07T20:33:03.1312924Z x1 = x1.contiguous() 2025-05-07T20:33:03.1313164Z 2025-05-07T20:33:03.1313353Z if scale_ub is not None: 2025-05-07T20:33:03.1313694Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:03.1314070Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:03.1314475Z ) 2025-05-07T20:33:03.1314668Z else: 2025-05-07T20:33:03.1314891Z scale_ub_tensor = None 2025-05-07T20:33:03.1315162Z 2025-05-07T20:33:03.1315402Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:03.1315752Z op = silu_mul_quant 2025-05-07T20:33:03.1316086Z if compiled: 2025-05-07T20:33:03.1316346Z op = torch.compile(op) 2025-05-07T20:33:03.1316675Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:03.1316982Z 2025-05-07T20:33:03.1317180Z > y_fp8, y_scale = fn() 2025-05-07T20:33:03.1317367Z 2025-05-07T20:33:03.1317469Z moe/activation_test.py:117: 2025-05-07T20:33:03.1317797Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.1318176Z moe/activation_test.py:115: in fn 2025-05-07T20:33:03.1318481Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:03.1319145Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:03.1319826Z return fn(*args, **kwargs) 2025-05-07T20:33:03.1320617Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:03.1321457Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:03.1322108Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:03.1322985Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:03.1323779Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:03.1324416Z kernel = self.compile( 2025-05-07T20:33:03.1325062Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:03.1325846Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:03.1326312Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.1326585Z 2025-05-07T20:33:03.1326825Z self = 2025-05-07T20:33:03.1328176Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:03.1329994Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd1b8a4bb80>} 2025-05-07T20:33:03.1331693Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:03.1332971Z context = 2025-05-07T20:33:03.1333320Z 2025-05-07T20:33:03.1333504Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:03.1334124Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:03.1334673Z module_map=module_map) 2025-05-07T20:33:03.1335052Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:03.1335411Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:03.1335665Z E ^ 2025-05-07T20:33:03.1336158Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:03.1336647Z 2025-05-07T20:33:03.1337091Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:03.1337687Z 2025-05-07T20:33:03.1337795Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:03.1338252Z self=, 2025-05-07T20:33:03.1338669Z T=1, 2025-05-07T20:33:03.1338848Z D=5120, 2025-05-07T20:33:03.1339033Z scale_ub=1200.0, 2025-05-07T20:33:03.1339253Z contiguous=False, 2025-05-07T20:33:03.1339480Z compiled=False, 2025-05-07T20:33:03.1339743Z ) 2025-05-07T20:33:03.1340064Z self = 2025-05-07T20:33:03.1340574Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:03.1340856Z 2025-05-07T20:33:03.1340936Z @given( 2025-05-07T20:33:03.1341159Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:03.1341472Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:03.1341782Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:03.1342115Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:03.1342445Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:03.1342739Z ) 2025-05-07T20:33:03.1343092Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:03.1343559Z def test_silu_mul_quant( 2025-05-07T20:33:03.1343799Z self, 2025-05-07T20:33:03.1343983Z T: int, 2025-05-07T20:33:03.1344178Z D: int, 2025-05-07T20:33:03.1344402Z scale_ub: Optional[float], 2025-05-07T20:33:03.1344674Z contiguous: bool, 2025-05-07T20:33:03.1344908Z compiled: bool, 2025-05-07T20:33:03.1345127Z ) -> None: 2025-05-07T20:33:03.1345341Z torch.manual_seed(2025) 2025-05-07T20:33:03.1345576Z 2025-05-07T20:33:03.1345849Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:03.1346203Z 2025-05-07T20:33:03.1346386Z x_sign = torch.sign(x) 2025-05-07T20:33:03.1346676Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:03.1346991Z x = x_sign * x_clamp 2025-05-07T20:33:03.1347229Z x0 = x[:, :D] 2025-05-07T20:33:03.1347439Z x1 = x[:, D:] 2025-05-07T20:33:03.1347646Z 2025-05-07T20:33:03.1347819Z if contiguous: 2025-05-07T20:33:03.1348046Z x0 = x0.contiguous() 2025-05-07T20:33:03.1348305Z x1 = x1.contiguous() 2025-05-07T20:33:03.1348539Z 2025-05-07T20:33:03.1348731Z if scale_ub is not None: 2025-05-07T20:33:03.1349006Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:03.1349339Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:03.1349654Z ) 2025-05-07T20:33:03.1349983Z else: 2025-05-07T20:33:03.1350190Z scale_ub_tensor = None 2025-05-07T20:33:03.1350437Z 2025-05-07T20:33:03.1350661Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:03.1350979Z op = silu_mul_quant 2025-05-07T20:33:03.1351222Z if compiled: 2025-05-07T20:33:03.1351471Z op = torch.compile(op) 2025-05-07T20:33:03.1351777Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:03.1352054Z 2025-05-07T20:33:03.1352243Z > y_fp8, y_scale = fn() 2025-05-07T20:33:03.1352409Z 2025-05-07T20:33:03.1352511Z moe/activation_test.py:117: 2025-05-07T20:33:03.1352828Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.1353165Z moe/activation_test.py:115: in fn 2025-05-07T20:33:03.1353453Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:03.1354194Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:03.1354931Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:03.1355491Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:03.1356264Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:03.1356965Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:03.1357561Z kernel = self.compile( 2025-05-07T20:33:03.1358130Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:03.1358829Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:03.1359279Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.1359518Z 2025-05-07T20:33:03.1359732Z self = 2025-05-07T20:33:03.1360898Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:03.1362405Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd1b8b57550>} 2025-05-07T20:33:03.1363877Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:03.1364974Z context = 2025-05-07T20:33:03.1365289Z 2025-05-07T20:33:03.1365459Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:03.1372244Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:03.1372755Z module_map=module_map) 2025-05-07T20:33:03.1373134Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:03.1373493Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:03.1373760Z E ^ 2025-05-07T20:33:03.1374263Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:03.1374757Z 2025-05-07T20:33:03.1375209Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:03.1375772Z 2025-05-07T20:33:03.1375875Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:03.1376305Z self=, 2025-05-07T20:33:03.1376726Z T=16384, 2025-05-07T20:33:03.1376913Z D=5120, 2025-05-07T20:33:03.1377101Z scale_ub=1200.0, 2025-05-07T20:33:03.1377448Z contiguous=False, 2025-05-07T20:33:03.1377674Z compiled=True, 2025-05-07T20:33:03.1377879Z ) 2025-05-07T20:33:03.2551428Z self = 2025-05-07T20:33:03.2552206Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:03.2552529Z 2025-05-07T20:33:03.2552619Z @given( 2025-05-07T20:33:03.2552856Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:03.2553191Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:03.2553517Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:03.2553855Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:03.2554201Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:03.2554513Z ) 2025-05-07T20:33:03.2554886Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:03.2555360Z def test_silu_mul_quant( 2025-05-07T20:33:03.2555614Z self, 2025-05-07T20:33:03.2555817Z T: int, 2025-05-07T20:33:03.2556017Z D: int, 2025-05-07T20:33:03.2556246Z scale_ub: Optional[float], 2025-05-07T20:33:03.2556531Z contiguous: bool, 2025-05-07T20:33:03.2556776Z compiled: bool, 2025-05-07T20:33:03.2557012Z ) -> None: 2025-05-07T20:33:03.2557348Z torch.manual_seed(2025) 2025-05-07T20:33:03.2557602Z 2025-05-07T20:33:03.2557934Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:03.2558286Z 2025-05-07T20:33:03.2558477Z x_sign = torch.sign(x) 2025-05-07T20:33:03.2558773Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:03.2559090Z x = x_sign * x_clamp 2025-05-07T20:33:03.2559330Z x0 = x[:, :D] 2025-05-07T20:33:03.2559609Z x1 = x[:, D:] 2025-05-07T20:33:03.2559815Z 2025-05-07T20:33:03.2560000Z if contiguous: 2025-05-07T20:33:03.2560235Z x0 = x0.contiguous() 2025-05-07T20:33:03.2560495Z x1 = x1.contiguous() 2025-05-07T20:33:03.2560737Z 2025-05-07T20:33:03.2560931Z if scale_ub is not None: 2025-05-07T20:33:03.2561199Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:03.2561539Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:03.2561858Z ) 2025-05-07T20:33:03.2562053Z else: 2025-05-07T20:33:03.2562256Z scale_ub_tensor = None 2025-05-07T20:33:03.2562519Z 2025-05-07T20:33:03.2562754Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:03.2563078Z op = silu_mul_quant 2025-05-07T20:33:03.2563338Z if compiled: 2025-05-07T20:33:03.2563588Z op = torch.compile(op) 2025-05-07T20:33:03.2563892Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:03.2564182Z 2025-05-07T20:33:03.2564378Z > y_fp8, y_scale = fn() 2025-05-07T20:33:03.2564547Z 2025-05-07T20:33:03.2564647Z moe/activation_test.py:117: 2025-05-07T20:33:03.2564956Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.2565305Z moe/activation_test.py:115: in fn 2025-05-07T20:33:03.2565592Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:03.2566190Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:03.2566789Z return fn(*args, **kwargs) 2025-05-07T20:33:03.2567503Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:03.2568245Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:03.2568817Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:03.2569562Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:03.2570347Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:03.2570921Z kernel = self.compile( 2025-05-07T20:33:03.2571499Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:03.2572203Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:03.2572611Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.2572864Z 2025-05-07T20:33:03.2573081Z self = 2025-05-07T20:33:03.2574255Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:03.2575777Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd1b84dd1f0>} 2025-05-07T20:33:03.2577255Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:03.2578401Z context = 2025-05-07T20:33:03.2578711Z 2025-05-07T20:33:03.2578879Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:03.2579465Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:03.2579957Z module_map=module_map) 2025-05-07T20:33:03.2580329Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:03.2580736Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:03.2581008Z E ^ 2025-05-07T20:33:03.2581502Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:03.2582044Z 2025-05-07T20:33:03.2582498Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:03.2583302Z 2025-05-07T20:33:03.2583407Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:03.2583842Z self=, 2025-05-07T20:33:03.2584265Z T=2048, 2025-05-07T20:33:03.2584459Z D=7168, 2025-05-07T20:33:03.2584649Z scale_ub=1200.0, 2025-05-07T20:33:03.2584874Z contiguous=False, 2025-05-07T20:33:03.2585099Z compiled=True, 2025-05-07T20:33:03.2585302Z ) 2025-05-07T20:33:03.2585627Z self = 2025-05-07T20:33:03.2586154Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:03.2586443Z 2025-05-07T20:33:03.2586530Z @given( 2025-05-07T20:33:03.2586766Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:03.2587092Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:03.2587413Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:03.2587754Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:03.2588081Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:03.2588372Z ) 2025-05-07T20:33:03.2588729Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:03.2589188Z def test_silu_mul_quant( 2025-05-07T20:33:03.2589434Z self, 2025-05-07T20:33:03.2589631Z T: int, 2025-05-07T20:33:03.2589972Z D: int, 2025-05-07T20:33:03.2590198Z scale_ub: Optional[float], 2025-05-07T20:33:03.2590479Z contiguous: bool, 2025-05-07T20:33:03.2590723Z compiled: bool, 2025-05-07T20:33:03.2590951Z ) -> None: 2025-05-07T20:33:03.2591169Z torch.manual_seed(2025) 2025-05-07T20:33:03.2591416Z 2025-05-07T20:33:03.2591767Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:03.2592127Z 2025-05-07T20:33:03.2592316Z x_sign = torch.sign(x) 2025-05-07T20:33:03.2592613Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:03.2592928Z x = x_sign * x_clamp 2025-05-07T20:33:03.2593176Z x0 = x[:, :D] 2025-05-07T20:33:03.2593391Z x1 = x[:, D:] 2025-05-07T20:33:03.2593606Z 2025-05-07T20:33:03.2593790Z if contiguous: 2025-05-07T20:33:03.2594020Z x0 = x0.contiguous() 2025-05-07T20:33:03.2594284Z x1 = x1.contiguous() 2025-05-07T20:33:03.2594535Z 2025-05-07T20:33:03.2594724Z if scale_ub is not None: 2025-05-07T20:33:03.2594997Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:03.2595337Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:03.2595654Z ) 2025-05-07T20:33:03.2595841Z else: 2025-05-07T20:33:03.2596052Z scale_ub_tensor = None 2025-05-07T20:33:03.2596308Z 2025-05-07T20:33:03.2596547Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:03.2596881Z op = silu_mul_quant 2025-05-07T20:33:03.2597139Z if compiled: 2025-05-07T20:33:03.2597387Z op = torch.compile(op) 2025-05-07T20:33:03.2597693Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:03.2598048Z 2025-05-07T20:33:03.2598233Z > y_fp8, y_scale = fn() 2025-05-07T20:33:03.2598455Z 2025-05-07T20:33:03.2598555Z moe/activation_test.py:117: 2025-05-07T20:33:03.2598862Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.2599206Z moe/activation_test.py:115: in fn 2025-05-07T20:33:03.2599493Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:03.2600144Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:03.2600742Z return fn(*args, **kwargs) 2025-05-07T20:33:03.2601447Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:03.2602195Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:03.2602769Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:03.2603654Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:03.2604367Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:03.2604933Z kernel = self.compile( 2025-05-07T20:33:03.2605500Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:03.2606196Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:03.2606608Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.2606847Z 2025-05-07T20:33:03.2607063Z self = 2025-05-07T20:33:03.2608372Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:03.2609878Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd1b84ddee0>} 2025-05-07T20:33:03.2611354Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:03.2612518Z context = 2025-05-07T20:33:03.2612822Z 2025-05-07T20:33:03.2612995Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:03.2613597Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:03.2614094Z module_map=module_map) 2025-05-07T20:33:03.2614470Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:03.2614829Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:03.2615085Z E ^ 2025-05-07T20:33:03.2615579Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:03.2616068Z 2025-05-07T20:33:03.2616519Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:03.2617074Z 2025-05-07T20:33:03.5271885Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:03.5273061Z self=, 2025-05-07T20:33:03.5273800Z T=1, 2025-05-07T20:33:03.5274111Z D=5120, 2025-05-07T20:33:03.5274436Z scale_ub=None, 2025-05-07T20:33:03.5274793Z contiguous=False, 2025-05-07T20:33:03.5275159Z compiled=False, 2025-05-07T20:33:03.5275496Z ) 2025-05-07T20:33:03.5276018Z self = 2025-05-07T20:33:03.5277190Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:03.5277619Z 2025-05-07T20:33:03.5277874Z @given( 2025-05-07T20:33:03.5278212Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:03.5278720Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:03.5279263Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:03.5279827Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:03.5280515Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:03.5280965Z ) 2025-05-07T20:33:03.5281602Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:03.5282399Z def test_silu_mul_quant( 2025-05-07T20:33:03.5283120Z self, 2025-05-07T20:33:03.5283447Z T: int, 2025-05-07T20:33:03.5283767Z D: int, 2025-05-07T20:33:03.5284112Z scale_ub: Optional[float], 2025-05-07T20:33:03.5284558Z contiguous: bool, 2025-05-07T20:33:03.5284946Z compiled: bool, 2025-05-07T20:33:03.5285304Z ) -> None: 2025-05-07T20:33:03.5285659Z torch.manual_seed(2025) 2025-05-07T20:33:03.5286074Z 2025-05-07T20:33:03.5286522Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:03.5287113Z 2025-05-07T20:33:03.5287433Z x_sign = torch.sign(x) 2025-05-07T20:33:03.5287919Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:03.5288426Z x = x_sign * x_clamp 2025-05-07T20:33:03.5288834Z x0 = x[:, :D] 2025-05-07T20:33:03.5289194Z x1 = x[:, D:] 2025-05-07T20:33:03.5289533Z 2025-05-07T20:33:03.5289841Z if contiguous: 2025-05-07T20:33:03.5290227Z x0 = x0.contiguous() 2025-05-07T20:33:03.5290655Z x1 = x1.contiguous() 2025-05-07T20:33:03.5291058Z 2025-05-07T20:33:03.5291375Z if scale_ub is not None: 2025-05-07T20:33:03.5291823Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:03.5292388Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:03.5292911Z ) 2025-05-07T20:33:03.5293223Z else: 2025-05-07T20:33:03.5293582Z scale_ub_tensor = None 2025-05-07T20:33:03.5294025Z 2025-05-07T20:33:03.5294399Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:03.5294940Z op = silu_mul_quant 2025-05-07T20:33:03.5295352Z if compiled: 2025-05-07T20:33:03.5295754Z op = torch.compile(op) 2025-05-07T20:33:03.5296243Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:03.5296708Z 2025-05-07T20:33:03.5297022Z > y_fp8, y_scale = fn() 2025-05-07T20:33:03.5297293Z 2025-05-07T20:33:03.5297606Z moe/activation_test.py:117: 2025-05-07T20:33:03.5298115Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.5298675Z moe/activation_test.py:115: in fn 2025-05-07T20:33:03.5299130Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:03.5300345Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:03.5301576Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:03.5302545Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:03.5303710Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:03.5306364Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:03.5307271Z kernel = self.compile( 2025-05-07T20:33:03.5308170Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:03.5309253Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:03.5310025Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.5310435Z 2025-05-07T20:33:03.5310897Z self = 2025-05-07T20:33:03.5312797Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:03.5315224Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd1b8ad85e0>} 2025-05-07T20:33:03.5317712Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:03.5319519Z context = 2025-05-07T20:33:03.5319999Z 2025-05-07T20:33:03.5320284Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:03.5321170Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:03.5321985Z module_map=module_map) 2025-05-07T20:33:03.5322597Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:03.5323177Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:03.5323617Z E ^ 2025-05-07T20:33:03.5324418Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:03.5325224Z 2025-05-07T20:33:03.5325971Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:03.5326859Z 2025-05-07T20:33:03.5327027Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:03.5327729Z self=, 2025-05-07T20:33:03.5328417Z T=4096, 2025-05-07T20:33:03.5328716Z D=7168, 2025-05-07T20:33:03.5329034Z scale_ub=1200.0, 2025-05-07T20:33:03.5329398Z contiguous=False, 2025-05-07T20:33:03.5329760Z compiled=False, 2025-05-07T20:33:03.5330101Z ) 2025-05-07T20:33:03.5330631Z self = 2025-05-07T20:33:03.5331481Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:03.5331988Z 2025-05-07T20:33:03.5332135Z @given( 2025-05-07T20:33:03.5332513Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:03.5333024Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:03.5333517Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:03.5334163Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:03.5334726Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:03.5335197Z ) 2025-05-07T20:33:03.5335786Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:03.5336552Z def test_silu_mul_quant( 2025-05-07T20:33:03.5336959Z self, 2025-05-07T20:33:03.5337270Z T: int, 2025-05-07T20:33:03.5337595Z D: int, 2025-05-07T20:33:03.5337950Z scale_ub: Optional[float], 2025-05-07T20:33:03.5338387Z contiguous: bool, 2025-05-07T20:33:03.5338780Z compiled: bool, 2025-05-07T20:33:03.5339133Z ) -> None: 2025-05-07T20:33:03.5339472Z torch.manual_seed(2025) 2025-05-07T20:33:03.5339870Z 2025-05-07T20:33:03.5340310Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:03.5340887Z 2025-05-07T20:33:03.5341198Z x_sign = torch.sign(x) 2025-05-07T20:33:03.5341676Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:03.5342184Z x = x_sign * x_clamp 2025-05-07T20:33:03.5342581Z x0 = x[:, :D] 2025-05-07T20:33:03.5342939Z x1 = x[:, D:] 2025-05-07T20:33:03.5343272Z 2025-05-07T20:33:03.5343572Z if contiguous: 2025-05-07T20:33:03.5344074Z x0 = x0.contiguous() 2025-05-07T20:33:03.5344494Z x1 = x1.contiguous() 2025-05-07T20:33:03.5344950Z 2025-05-07T20:33:03.5345259Z if scale_ub is not None: 2025-05-07T20:33:03.5345709Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:03.5346254Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:03.5346752Z ) 2025-05-07T20:33:03.5347060Z else: 2025-05-07T20:33:03.5348087Z scale_ub_tensor = None 2025-05-07T20:33:03.5348504Z 2025-05-07T20:33:03.5348875Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:03.5349392Z op = silu_mul_quant 2025-05-07T20:33:03.5349974Z if compiled: 2025-05-07T20:33:03.5350388Z op = torch.compile(op) 2025-05-07T20:33:03.5350851Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:03.5351288Z 2025-05-07T20:33:03.5351579Z > y_fp8, y_scale = fn() 2025-05-07T20:33:03.5351826Z 2025-05-07T20:33:03.5351979Z moe/activation_test.py:117: 2025-05-07T20:33:03.5352449Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.5353005Z moe/activation_test.py:115: in fn 2025-05-07T20:33:03.5353471Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:03.5354636Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:03.5355868Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:03.5356797Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:03.5357987Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:03.5359146Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:03.5360083Z kernel = self.compile( 2025-05-07T20:33:03.5361025Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:03.5362158Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:03.5362835Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.5363222Z 2025-05-07T20:33:03.5363564Z self = 2025-05-07T20:33:03.5365491Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:03.5367923Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd1b8c291f0>} 2025-05-07T20:33:03.5370101Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:03.5371902Z context = 2025-05-07T20:33:03.5372399Z 2025-05-07T20:33:03.5372683Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:03.5373579Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:03.5374388Z module_map=module_map) 2025-05-07T20:33:03.5374997Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:03.5375590Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:03.5376012Z E ^ 2025-05-07T20:33:03.5376811Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:03.5377611Z 2025-05-07T20:33:03.5378418Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:03.5379326Z 2025-05-07T20:33:03.5379562Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:03.5380244Z self=, 2025-05-07T20:33:03.5380927Z T=16384, 2025-05-07T20:33:03.5381242Z D=7168, 2025-05-07T20:33:03.5381549Z scale_ub=None, 2025-05-07T20:33:03.5381901Z contiguous=True, 2025-05-07T20:33:03.5382404Z compiled=True, 2025-05-07T20:33:03.5383007Z ) 2025-05-07T20:33:03.8233629Z self = 2025-05-07T20:33:03.8234597Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:03.8235096Z 2025-05-07T20:33:03.8235225Z @given( 2025-05-07T20:33:03.8235612Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:03.8236117Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:03.8236597Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:03.8237106Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:03.8237599Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:03.8238056Z ) 2025-05-07T20:33:03.8238645Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:03.8239402Z def test_silu_mul_quant( 2025-05-07T20:33:03.8239791Z self, 2025-05-07T20:33:03.8240108Z T: int, 2025-05-07T20:33:03.8240437Z D: int, 2025-05-07T20:33:03.8240779Z scale_ub: Optional[float], 2025-05-07T20:33:03.8241232Z contiguous: bool, 2025-05-07T20:33:03.8241642Z compiled: bool, 2025-05-07T20:33:03.8242048Z ) -> None: 2025-05-07T20:33:03.8242398Z torch.manual_seed(2025) 2025-05-07T20:33:03.8242813Z 2025-05-07T20:33:03.8243250Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:03.8243825Z 2025-05-07T20:33:03.8244139Z x_sign = torch.sign(x) 2025-05-07T20:33:03.8244615Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:03.8256113Z x = x_sign * x_clamp 2025-05-07T20:33:03.8256531Z x0 = x[:, :D] 2025-05-07T20:33:03.8256882Z x1 = x[:, D:] 2025-05-07T20:33:03.8257227Z 2025-05-07T20:33:03.8257536Z if contiguous: 2025-05-07T20:33:03.8257909Z x0 = x0.contiguous() 2025-05-07T20:33:03.8258345Z x1 = x1.contiguous() 2025-05-07T20:33:03.8258755Z 2025-05-07T20:33:03.8259062Z if scale_ub is not None: 2025-05-07T20:33:03.8259518Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:03.8260391Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:03.8260908Z ) 2025-05-07T20:33:03.8261223Z else: 2025-05-07T20:33:03.8261574Z scale_ub_tensor = None 2025-05-07T20:33:03.8261990Z 2025-05-07T20:33:03.8262383Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:03.8262904Z op = silu_mul_quant 2025-05-07T20:33:03.8263320Z if compiled: 2025-05-07T20:33:03.8263728Z op = torch.compile(op) 2025-05-07T20:33:03.8264215Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:03.8264678Z 2025-05-07T20:33:03.8264979Z > y_fp8, y_scale = fn() 2025-05-07T20:33:03.8265264Z 2025-05-07T20:33:03.8265428Z moe/activation_test.py:117: 2025-05-07T20:33:03.8265915Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.8266472Z moe/activation_test.py:115: in fn 2025-05-07T20:33:03.8266947Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:03.8267920Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:03.8268890Z return fn(*args, **kwargs) 2025-05-07T20:33:03.8270220Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:03.8271574Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:03.8272557Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:03.8273842Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:03.8275015Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:03.8275935Z kernel = self.compile( 2025-05-07T20:33:03.8276997Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:03.8278120Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:03.8278799Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.8279198Z 2025-05-07T20:33:03.8279551Z self = 2025-05-07T20:33:03.8281455Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:03.8284301Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd1b8c29ee0>} 2025-05-07T20:33:03.8286713Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:03.8288539Z context = 2025-05-07T20:33:03.8289040Z 2025-05-07T20:33:03.8289325Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:03.8290221Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:03.8291037Z module_map=module_map) 2025-05-07T20:33:03.8291646Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:03.8292238Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:03.8292665Z E ^ 2025-05-07T20:33:03.8293472Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:03.8294275Z 2025-05-07T20:33:03.8295021Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:03.8295930Z 2025-05-07T20:33:03.8296110Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:03.8296912Z self=, 2025-05-07T20:33:03.8297616Z T=4096, 2025-05-07T20:33:03.8297931Z D=5120, 2025-05-07T20:33:03.8298239Z scale_ub=None, 2025-05-07T20:33:03.8298593Z contiguous=False, 2025-05-07T20:33:03.8298961Z compiled=True, 2025-05-07T20:33:03.8299294Z ) 2025-05-07T20:33:03.8299827Z self = 2025-05-07T20:33:03.8300679Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:03.8301151Z 2025-05-07T20:33:03.8301277Z @given( 2025-05-07T20:33:03.8301657Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:03.8302234Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:03.8302751Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:03.8303266Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:03.8303760Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:03.8304144Z ) 2025-05-07T20:33:03.8304607Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:03.8305246Z def test_silu_mul_quant( 2025-05-07T20:33:03.8305592Z self, 2025-05-07T20:33:03.8305838Z T: int, 2025-05-07T20:33:03.8306216Z D: int, 2025-05-07T20:33:03.8306522Z scale_ub: Optional[float], 2025-05-07T20:33:03.8306972Z contiguous: bool, 2025-05-07T20:33:03.8307302Z compiled: bool, 2025-05-07T20:33:03.8307615Z ) -> None: 2025-05-07T20:33:03.8307905Z torch.manual_seed(2025) 2025-05-07T20:33:03.8308261Z 2025-05-07T20:33:03.8308649Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:03.8309313Z 2025-05-07T20:33:03.8309586Z x_sign = torch.sign(x) 2025-05-07T20:33:03.8310114Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:03.8310558Z x = x_sign * x_clamp 2025-05-07T20:33:03.8310908Z x0 = x[:, :D] 2025-05-07T20:33:03.8311228Z x1 = x[:, D:] 2025-05-07T20:33:03.8311545Z 2025-05-07T20:33:03.8311809Z if contiguous: 2025-05-07T20:33:03.8312149Z x0 = x0.contiguous() 2025-05-07T20:33:03.8312522Z x1 = x1.contiguous() 2025-05-07T20:33:03.8312874Z 2025-05-07T20:33:03.8313152Z if scale_ub is not None: 2025-05-07T20:33:03.8313553Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:03.8314066Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:03.8314503Z ) 2025-05-07T20:33:03.8314792Z else: 2025-05-07T20:33:03.8315111Z scale_ub_tensor = None 2025-05-07T20:33:03.8315518Z 2025-05-07T20:33:03.8315880Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:03.8316397Z op = silu_mul_quant 2025-05-07T20:33:03.8316787Z if compiled: 2025-05-07T20:33:03.8317176Z op = torch.compile(op) 2025-05-07T20:33:03.8317654Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:03.8318092Z 2025-05-07T20:33:03.8318393Z > y_fp8, y_scale = fn() 2025-05-07T20:33:03.8318653Z 2025-05-07T20:33:03.8318812Z moe/activation_test.py:117: 2025-05-07T20:33:03.8319278Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.8319822Z moe/activation_test.py:115: in fn 2025-05-07T20:33:03.8320271Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:03.8321189Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:03.8322129Z return fn(*args, **kwargs) 2025-05-07T20:33:03.8323246Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:03.8324375Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:03.8325332Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:03.8326453Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:03.8327542Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:03.8328444Z kernel = self.compile( 2025-05-07T20:33:03.8329362Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:03.8330464Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:03.8331143Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.8331543Z 2025-05-07T20:33:03.8331873Z self = 2025-05-07T20:33:03.8333729Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:03.8336134Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd1b8987940>} 2025-05-07T20:33:03.8338604Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:03.8340450Z context = 2025-05-07T20:33:03.8340947Z 2025-05-07T20:33:03.8341202Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:03.8342040Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:03.8342919Z module_map=module_map) 2025-05-07T20:33:03.8343507Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:03.8344092Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:03.8344520Z E ^ 2025-05-07T20:33:03.8345310Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:03.8346097Z 2025-05-07T20:33:03.8346819Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:03.8347722Z 2025-05-07T20:33:04.0277783Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:04.0278604Z self=, 2025-05-07T20:33:04.0279293Z T=4096, 2025-05-07T20:33:04.0279598Z D=5120, 2025-05-07T20:33:04.0279908Z scale_ub=1200.0, 2025-05-07T20:33:04.0280292Z contiguous=False, 2025-05-07T20:33:04.0280646Z compiled=False, 2025-05-07T20:33:04.0280971Z ) 2025-05-07T20:33:04.0281445Z self = 2025-05-07T20:33:04.0282283Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:04.0283066Z 2025-05-07T20:33:04.0283223Z @given( 2025-05-07T20:33:04.0283589Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:04.0284117Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:04.0284638Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:04.0285199Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:04.0285739Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:04.0286221Z ) 2025-05-07T20:33:04.0286804Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:04.0287556Z def test_silu_mul_quant( 2025-05-07T20:33:04.0287959Z self, 2025-05-07T20:33:04.0288274Z T: int, 2025-05-07T20:33:04.0288584Z D: int, 2025-05-07T20:33:04.0288937Z scale_ub: Optional[float], 2025-05-07T20:33:04.0289386Z contiguous: bool, 2025-05-07T20:33:04.0290077Z compiled: bool, 2025-05-07T20:33:04.0290458Z ) -> None: 2025-05-07T20:33:04.0290807Z torch.manual_seed(2025) 2025-05-07T20:33:04.0291201Z 2025-05-07T20:33:04.0291645Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:04.0292229Z 2025-05-07T20:33:04.0292543Z x_sign = torch.sign(x) 2025-05-07T20:33:04.0293034Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:04.0293572Z x = x_sign * x_clamp 2025-05-07T20:33:04.0293967Z x0 = x[:, :D] 2025-05-07T20:33:04.0294318Z x1 = x[:, D:] 2025-05-07T20:33:04.0294656Z 2025-05-07T20:33:04.0294958Z if contiguous: 2025-05-07T20:33:04.0295325Z x0 = x0.contiguous() 2025-05-07T20:33:04.0295756Z x1 = x1.contiguous() 2025-05-07T20:33:04.0296161Z 2025-05-07T20:33:04.0296498Z if scale_ub is not None: 2025-05-07T20:33:04.0296940Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:04.0297497Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:04.0298014Z ) 2025-05-07T20:33:04.0298315Z else: 2025-05-07T20:33:04.0298656Z scale_ub_tensor = None 2025-05-07T20:33:04.0299070Z 2025-05-07T20:33:04.0299439Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:04.0300086Z op = silu_mul_quant 2025-05-07T20:33:04.0300623Z if compiled: 2025-05-07T20:33:04.0301017Z op = torch.compile(op) 2025-05-07T20:33:04.0301493Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:04.0301947Z 2025-05-07T20:33:04.0302253Z > y_fp8, y_scale = fn() 2025-05-07T20:33:04.0302520Z 2025-05-07T20:33:04.0302679Z moe/activation_test.py:117: 2025-05-07T20:33:04.0303288Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:04.0303850Z moe/activation_test.py:115: in fn 2025-05-07T20:33:04.0304299Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:04.0305473Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:04.0306665Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:04.0307580Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:04.0308742Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:04.0310049Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:04.0310963Z kernel = self.compile( 2025-05-07T20:33:04.0311904Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:04.0313063Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:04.0313740Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:04.0314142Z 2025-05-07T20:33:04.0314490Z self = 2025-05-07T20:33:04.0316384Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:04.0318860Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd1b872f3a0>} 2025-05-07T20:33:04.0321254Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:04.0323061Z context = 2025-05-07T20:33:04.0323560Z 2025-05-07T20:33:04.0323932Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:04.0324831Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:04.0325636Z module_map=module_map) 2025-05-07T20:33:04.0326232Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:04.0326814Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:04.0327255Z E ^ 2025-05-07T20:33:04.0328055Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:04.0328861Z 2025-05-07T20:33:04.0329599Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:04.0330514Z 2025-05-07T20:33:04.0330682Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:04.0331383Z self=, 2025-05-07T20:33:04.0332079Z T=4096, 2025-05-07T20:33:04.0332391Z D=5120, 2025-05-07T20:33:04.0332698Z scale_ub=1200.0, 2025-05-07T20:33:04.0333067Z contiguous=False, 2025-05-07T20:33:04.0333435Z compiled=True, 2025-05-07T20:33:04.0333762Z ) 2025-05-07T20:33:04.0334290Z self = 2025-05-07T20:33:04.0335241Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:04.0335767Z 2025-05-07T20:33:04.0335892Z @given( 2025-05-07T20:33:04.0336260Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:04.0336790Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:04.0337294Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:04.0337845Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:04.0338465Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:04.0338949Z ) 2025-05-07T20:33:04.0339535Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:04.0340300Z def test_silu_mul_quant( 2025-05-07T20:33:04.0340701Z self, 2025-05-07T20:33:04.0341012Z T: int, 2025-05-07T20:33:04.0341333Z D: int, 2025-05-07T20:33:04.0341685Z scale_ub: Optional[float], 2025-05-07T20:33:04.0342127Z contiguous: bool, 2025-05-07T20:33:04.0342526Z compiled: bool, 2025-05-07T20:33:04.0342893Z ) -> None: 2025-05-07T20:33:04.0343237Z torch.manual_seed(2025) 2025-05-07T20:33:04.0343640Z 2025-05-07T20:33:04.0344085Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:04.0344661Z 2025-05-07T20:33:04.0344982Z x_sign = torch.sign(x) 2025-05-07T20:33:04.0345460Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:04.0345977Z x = x_sign * x_clamp 2025-05-07T20:33:04.0346372Z x0 = x[:, :D] 2025-05-07T20:33:04.0346729Z x1 = x[:, D:] 2025-05-07T20:33:04.0347072Z 2025-05-07T20:33:04.0347368Z if contiguous: 2025-05-07T20:33:04.0347750Z x0 = x0.contiguous() 2025-05-07T20:33:04.0348171Z x1 = x1.contiguous() 2025-05-07T20:33:04.0348550Z 2025-05-07T20:33:04.0348809Z if scale_ub is not None: 2025-05-07T20:33:04.0349170Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:04.0349604Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:04.0350218Z ) 2025-05-07T20:33:04.0350499Z else: 2025-05-07T20:33:04.0350773Z scale_ub_tensor = None 2025-05-07T20:33:04.0351126Z 2025-05-07T20:33:04.0351437Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:04.0351877Z op = silu_mul_quant 2025-05-07T20:33:04.0352240Z if compiled: 2025-05-07T20:33:04.0352585Z op = torch.compile(op) 2025-05-07T20:33:04.0352982Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:04.0353367Z 2025-05-07T20:33:04.0353713Z > y_fp8, y_scale = fn() 2025-05-07T20:33:04.0353956Z 2025-05-07T20:33:04.0354101Z moe/activation_test.py:117: 2025-05-07T20:33:04.0354546Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:04.0355042Z moe/activation_test.py:115: in fn 2025-05-07T20:33:04.0355445Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:04.0356290Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:04.0357154Z return fn(*args, **kwargs) 2025-05-07T20:33:04.0358183Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:04.0359358Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:04.0360243Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:04.0361394Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:04.0362507Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:04.0363337Z kernel = self.compile( 2025-05-07T20:33:04.0364222Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:04.0365390Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:04.0366095Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:04.0366480Z 2025-05-07T20:33:04.0366816Z self = 2025-05-07T20:33:04.0368662Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:04.0371188Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd1b872f280>} 2025-05-07T20:33:04.0373524Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:04.0375295Z context = 2025-05-07T20:33:04.0375808Z 2025-05-07T20:33:04.0376084Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:04.0376978Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:04.0377773Z module_map=module_map) 2025-05-07T20:33:04.0378363Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:04.0378945Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:04.0379381Z E ^ 2025-05-07T20:33:04.0380176Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:04.0380950Z 2025-05-07T20:33:04.0381687Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:04.0382641Z 2025-05-07T20:33:04.3116759Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:04.3117982Z self=, 2025-05-07T20:33:04.3118742Z T=2048, 2025-05-07T20:33:04.3119054Z D=7168, 2025-05-07T20:33:04.3119358Z scale_ub=1200.0, 2025-05-07T20:33:04.3119714Z contiguous=False, 2025-05-07T20:33:04.3120075Z compiled=False, 2025-05-07T20:33:04.3120408Z ) 2025-05-07T20:33:04.3120937Z self = 2025-05-07T20:33:04.3121796Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:04.3122244Z 2025-05-07T20:33:04.3122669Z @given( 2025-05-07T20:33:04.3123002Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:04.3123446Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:04.3123895Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:04.3124394Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:04.3124894Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:04.3125349Z ) 2025-05-07T20:33:04.3125906Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:04.3126597Z def test_silu_mul_quant( 2025-05-07T20:33:04.3126954Z self, 2025-05-07T20:33:04.3127242Z T: int, 2025-05-07T20:33:04.3127526Z D: int, 2025-05-07T20:33:04.3127850Z scale_ub: Optional[float], 2025-05-07T20:33:04.3128292Z contiguous: bool, 2025-05-07T20:33:04.3128663Z compiled: bool, 2025-05-07T20:33:04.3129012Z ) -> None: 2025-05-07T20:33:04.3129359Z torch.manual_seed(2025) 2025-05-07T20:33:04.3129737Z 2025-05-07T20:33:04.3130159Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:04.3130747Z 2025-05-07T20:33:04.3131066Z x_sign = torch.sign(x) 2025-05-07T20:33:04.3131526Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:04.3132231Z x = x_sign * x_clamp 2025-05-07T20:33:04.3132641Z x0 = x[:, :D] 2025-05-07T20:33:04.3133108Z x1 = x[:, D:] 2025-05-07T20:33:04.3133440Z 2025-05-07T20:33:04.3133738Z if contiguous: 2025-05-07T20:33:04.3134096Z x0 = x0.contiguous() 2025-05-07T20:33:04.3134516Z x1 = x1.contiguous() 2025-05-07T20:33:04.3134912Z 2025-05-07T20:33:04.3135216Z if scale_ub is not None: 2025-05-07T20:33:04.3135795Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:04.3136350Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:04.3136864Z ) 2025-05-07T20:33:04.3137168Z else: 2025-05-07T20:33:04.3137521Z scale_ub_tensor = None 2025-05-07T20:33:04.3137936Z 2025-05-07T20:33:04.3138303Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:04.3138825Z op = silu_mul_quant 2025-05-07T20:33:04.3139232Z if compiled: 2025-05-07T20:33:04.3139628Z op = torch.compile(op) 2025-05-07T20:33:04.3140116Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:04.3140577Z 2025-05-07T20:33:04.3140878Z > y_fp8, y_scale = fn() 2025-05-07T20:33:04.3141161Z 2025-05-07T20:33:04.3141325Z moe/activation_test.py:117: 2025-05-07T20:33:04.3141811Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:04.3142360Z moe/activation_test.py:115: in fn 2025-05-07T20:33:04.3142835Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:04.3144038Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:04.3145237Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:04.3146153Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:04.3147343Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:04.3148503Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:04.3149446Z kernel = self.compile( 2025-05-07T20:33:04.3150565Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:04.3151735Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:04.3152473Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:04.3152864Z 2025-05-07T20:33:04.3153207Z self = 2025-05-07T20:33:04.3155186Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:04.3157573Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd1b856d670>} 2025-05-07T20:33:04.3159951Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:04.3171521Z context = 2025-05-07T20:33:04.3172071Z 2025-05-07T20:33:04.3172354Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:04.3173261Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:04.3174062Z module_map=module_map) 2025-05-07T20:33:04.3174666Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:04.3175242Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:04.3175676Z E ^ 2025-05-07T20:33:04.3176616Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:04.3177482Z 2025-05-07T20:33:04.3178212Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:04.3179126Z 2025-05-07T20:33:04.3179294Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:04.3179978Z self=, 2025-05-07T20:33:04.3180726Z T=1, 2025-05-07T20:33:04.3181019Z D=7168, 2025-05-07T20:33:04.3181333Z scale_ub=None, 2025-05-07T20:33:04.3181681Z contiguous=True, 2025-05-07T20:33:04.3182039Z compiled=False, 2025-05-07T20:33:04.3182374Z ) 2025-05-07T20:33:04.3183236Z self = 2025-05-07T20:33:04.3184057Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:04.3184508Z 2025-05-07T20:33:04.3184636Z @given( 2025-05-07T20:33:04.3185006Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:04.3185528Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:04.3186027Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:04.3186571Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:04.3187101Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:04.3187560Z ) 2025-05-07T20:33:04.3188156Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:04.3188916Z def test_silu_mul_quant( 2025-05-07T20:33:04.3189303Z self, 2025-05-07T20:33:04.3189621Z T: int, 2025-05-07T20:33:04.3190061Z D: int, 2025-05-07T20:33:04.3190401Z scale_ub: Optional[float], 2025-05-07T20:33:04.3190853Z contiguous: bool, 2025-05-07T20:33:04.3191246Z compiled: bool, 2025-05-07T20:33:04.3191592Z ) -> None: 2025-05-07T20:33:04.3191938Z torch.manual_seed(2025) 2025-05-07T20:33:04.3192330Z 2025-05-07T20:33:04.3192768Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:04.3193350Z 2025-05-07T20:33:04.3193648Z x_sign = torch.sign(x) 2025-05-07T20:33:04.3194118Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:04.3194618Z x = x_sign * x_clamp 2025-05-07T20:33:04.3194999Z x0 = x[:, :D] 2025-05-07T20:33:04.3195337Z x1 = x[:, D:] 2025-05-07T20:33:04.3195662Z 2025-05-07T20:33:04.3195944Z if contiguous: 2025-05-07T20:33:04.3196315Z x0 = x0.contiguous() 2025-05-07T20:33:04.3196732Z x1 = x1.contiguous() 2025-05-07T20:33:04.3197250Z 2025-05-07T20:33:04.3197567Z if scale_ub is not None: 2025-05-07T20:33:04.3198014Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:04.3198567Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:04.3199068Z ) 2025-05-07T20:33:04.3199374Z else: 2025-05-07T20:33:04.3199707Z scale_ub_tensor = None 2025-05-07T20:33:04.3200091Z 2025-05-07T20:33:04.3200453Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:04.3200976Z op = silu_mul_quant 2025-05-07T20:33:04.3201381Z if compiled: 2025-05-07T20:33:04.3201779Z op = torch.compile(op) 2025-05-07T20:33:04.3202260Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:04.3202708Z 2025-05-07T20:33:04.3203015Z > y_fp8, y_scale = fn() 2025-05-07T20:33:04.3203290Z 2025-05-07T20:33:04.3203459Z moe/activation_test.py:117: 2025-05-07T20:33:04.3203943Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:04.3204501Z moe/activation_test.py:115: in fn 2025-05-07T20:33:04.3204965Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:04.3206270Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:04.3207485Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:04.3208491Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:04.3209688Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:04.3210805Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:04.3211819Z kernel = self.compile( 2025-05-07T20:33:04.3212772Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:04.3213894Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:04.3214553Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:04.3214963Z 2025-05-07T20:33:04.3215295Z self = 2025-05-07T20:33:04.3216996Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:04.3219438Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd1b8634280>} 2025-05-07T20:33:04.3221835Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:04.3223621Z context = 2025-05-07T20:33:04.3224118Z 2025-05-07T20:33:04.3224393Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:04.3225288Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:04.3226080Z module_map=module_map) 2025-05-07T20:33:04.3226684Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:04.3227268Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:04.3227696Z E ^ 2025-05-07T20:33:04.3228476Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:04.3229281Z 2025-05-07T20:33:04.3230153Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:04.3231056Z 2025-05-07T20:33:04.3231319Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:04.3232013Z self=, 2025-05-07T20:33:04.3232689Z T=16384, 2025-05-07T20:33:04.3232995Z D=7168, 2025-05-07T20:33:04.3233303Z scale_ub=1200.0, 2025-05-07T20:33:04.3233652Z contiguous=False, 2025-05-07T20:33:04.3234018Z compiled=True, 2025-05-07T20:33:04.3234343Z ) 2025-05-07T20:33:04.5142770Z self = 2025-05-07T20:33:04.5143706Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:04.5144182Z 2025-05-07T20:33:04.5144314Z @given( 2025-05-07T20:33:04.5144690Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:04.5145230Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:04.5145721Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:04.5146285Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:04.5146837Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:04.5147226Z ) 2025-05-07T20:33:04.5147720Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:04.5148368Z def test_silu_mul_quant( 2025-05-07T20:33:04.5148718Z self, 2025-05-07T20:33:04.5149269Z T: int, 2025-05-07T20:33:04.5149570Z D: int, 2025-05-07T20:33:04.5150160Z scale_ub: Optional[float], 2025-05-07T20:33:04.5150583Z contiguous: bool, 2025-05-07T20:33:04.5150966Z compiled: bool, 2025-05-07T20:33:04.5151298Z ) -> None: 2025-05-07T20:33:04.5151597Z torch.manual_seed(2025) 2025-05-07T20:33:04.5151962Z 2025-05-07T20:33:04.5152363Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:04.5153060Z 2025-05-07T20:33:04.5153356Z x_sign = torch.sign(x) 2025-05-07T20:33:04.5153817Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:04.5154306Z x = x_sign * x_clamp 2025-05-07T20:33:04.5154696Z x0 = x[:, :D] 2025-05-07T20:33:04.5155057Z x1 = x[:, D:] 2025-05-07T20:33:04.5155402Z 2025-05-07T20:33:04.5155700Z if contiguous: 2025-05-07T20:33:04.5156053Z x0 = x0.contiguous() 2025-05-07T20:33:04.5156453Z x1 = x1.contiguous() 2025-05-07T20:33:04.5156856Z 2025-05-07T20:33:04.5157167Z if scale_ub is not None: 2025-05-07T20:33:04.5157614Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:04.5158150Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:04.5158654Z ) 2025-05-07T20:33:04.5158981Z else: 2025-05-07T20:33:04.5159327Z scale_ub_tensor = None 2025-05-07T20:33:04.5159745Z 2025-05-07T20:33:04.5160122Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:04.5160642Z op = silu_mul_quant 2025-05-07T20:33:04.5161047Z if compiled: 2025-05-07T20:33:04.5161444Z op = torch.compile(op) 2025-05-07T20:33:04.5161917Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:04.5162364Z 2025-05-07T20:33:04.5162675Z > y_fp8, y_scale = fn() 2025-05-07T20:33:04.5162941Z 2025-05-07T20:33:04.5163102Z moe/activation_test.py:117: 2025-05-07T20:33:04.5163597Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:04.5164161Z moe/activation_test.py:115: in fn 2025-05-07T20:33:04.5164623Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:04.5165568Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:04.5166539Z return fn(*args, **kwargs) 2025-05-07T20:33:04.5167678Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:04.5168873Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:04.5169945Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:04.5171139Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:04.5172341Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:04.5173264Z kernel = self.compile( 2025-05-07T20:33:04.5174203Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:04.5175346Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:04.5176015Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:04.5176423Z 2025-05-07T20:33:04.5176757Z self = 2025-05-07T20:33:04.5178653Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:04.5181051Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd1b8634ee0>} 2025-05-07T20:33:04.5183870Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:04.5185707Z context = 2025-05-07T20:33:04.5186207Z 2025-05-07T20:33:04.5186480Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:04.5187465Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:04.5188265Z module_map=module_map) 2025-05-07T20:33:04.5188859Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:04.5189438Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:04.5189981Z E ^ 2025-05-07T20:33:04.5190769Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:04.5191569Z 2025-05-07T20:33:04.5192286Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:04.5193210Z 2025-05-07T20:33:04.5193377Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:04.5194068Z self=, 2025-05-07T20:33:04.5194738Z T=1, 2025-05-07T20:33:04.5195049Z D=7168, 2025-05-07T20:33:04.5195356Z scale_ub=None, 2025-05-07T20:33:04.5195703Z contiguous=False, 2025-05-07T20:33:04.5196078Z compiled=False, 2025-05-07T20:33:04.5196402Z ) 2025-05-07T20:33:04.5196923Z self = 2025-05-07T20:33:04.5197753Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:04.5198181Z 2025-05-07T20:33:04.5198308Z @given( 2025-05-07T20:33:04.5198667Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:04.5199185Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:04.5199687Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:04.5200225Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:04.5200774Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:04.5201252Z ) 2025-05-07T20:33:04.5201829Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:04.5202580Z def test_silu_mul_quant( 2025-05-07T20:33:04.5202976Z self, 2025-05-07T20:33:04.5203275Z T: int, 2025-05-07T20:33:04.5203586Z D: int, 2025-05-07T20:33:04.5203933Z scale_ub: Optional[float], 2025-05-07T20:33:04.5204519Z contiguous: bool, 2025-05-07T20:33:04.5204899Z compiled: bool, 2025-05-07T20:33:04.5205255Z ) -> None: 2025-05-07T20:33:04.5205593Z torch.manual_seed(2025) 2025-05-07T20:33:04.5205976Z 2025-05-07T20:33:04.5206415Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:04.5206996Z 2025-05-07T20:33:04.5207290Z x_sign = torch.sign(x) 2025-05-07T20:33:04.5207763Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:04.5208279Z x = x_sign * x_clamp 2025-05-07T20:33:04.5208656Z x0 = x[:, :D] 2025-05-07T20:33:04.5209001Z x1 = x[:, D:] 2025-05-07T20:33:04.5209333Z 2025-05-07T20:33:04.5209620Z if contiguous: 2025-05-07T20:33:04.5209992Z x0 = x0.contiguous() 2025-05-07T20:33:04.5210417Z x1 = x1.contiguous() 2025-05-07T20:33:04.5210802Z 2025-05-07T20:33:04.5211096Z if scale_ub is not None: 2025-05-07T20:33:04.5211540Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:04.5212075Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:04.5212585Z ) 2025-05-07T20:33:04.5212891Z else: 2025-05-07T20:33:04.5213222Z scale_ub_tensor = None 2025-05-07T20:33:04.5213626Z 2025-05-07T20:33:04.5214070Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:04.5214600Z op = silu_mul_quant 2025-05-07T20:33:04.5215053Z if compiled: 2025-05-07T20:33:04.5215448Z op = torch.compile(op) 2025-05-07T20:33:04.5215926Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:04.5216368Z 2025-05-07T20:33:04.5216673Z > y_fp8, y_scale = fn() 2025-05-07T20:33:04.5216944Z 2025-05-07T20:33:04.5217179Z moe/activation_test.py:117: 2025-05-07T20:33:04.5217655Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:04.5218215Z moe/activation_test.py:115: in fn 2025-05-07T20:33:04.5218669Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:04.5219848Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:04.5221049Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:04.5221963Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:04.5223154Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:04.5224245Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:04.5225166Z kernel = self.compile( 2025-05-07T20:33:04.5226104Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:04.5227264Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:04.5227937Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:04.5228351Z 2025-05-07T20:33:04.5228693Z self = 2025-05-07T20:33:04.5230694Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:04.5233177Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd1b87bb670>} 2025-05-07T20:33:04.5235525Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:04.5237309Z context = 2025-05-07T20:33:04.5237812Z 2025-05-07T20:33:04.5238171Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:04.5239087Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:04.5239822Z module_map=module_map) 2025-05-07T20:33:04.5240356Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:04.5240910Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:04.5241334Z E ^ 2025-05-07T20:33:04.5242118Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:04.5242925Z 2025-05-07T20:33:04.5243651Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:04.5244560Z 2025-05-07T20:33:04.5244734Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:04.5245426Z self=, 2025-05-07T20:33:04.5246106Z T=2048, 2025-05-07T20:33:04.5246410Z D=7168, 2025-05-07T20:33:04.5246717Z scale_ub=None, 2025-05-07T20:33:04.5247061Z contiguous=False, 2025-05-07T20:33:04.5247423Z compiled=True, 2025-05-07T20:33:04.5247751Z ) 2025-05-07T20:33:04.8154529Z self = 2025-05-07T20:33:04.8155488Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:04.8156100Z 2025-05-07T20:33:04.8156230Z @given( 2025-05-07T20:33:04.8156600Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:04.8157092Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:04.8157558Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:04.8158161Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:04.8158692Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:04.8159158Z ) 2025-05-07T20:33:04.8159756Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:04.8160522Z def test_silu_mul_quant( 2025-05-07T20:33:04.8160927Z self, 2025-05-07T20:33:04.8161232Z T: int, 2025-05-07T20:33:04.8161551Z D: int, 2025-05-07T20:33:04.8161906Z scale_ub: Optional[float], 2025-05-07T20:33:04.8162347Z contiguous: bool, 2025-05-07T20:33:04.8162745Z compiled: bool, 2025-05-07T20:33:04.8163119Z ) -> None: 2025-05-07T20:33:04.8163459Z torch.manual_seed(2025) 2025-05-07T20:33:04.8163856Z 2025-05-07T20:33:04.8164295Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:04.8164870Z 2025-05-07T20:33:04.8165181Z x_sign = torch.sign(x) 2025-05-07T20:33:04.8165658Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:04.8166175Z x = x_sign * x_clamp 2025-05-07T20:33:04.8166569Z x0 = x[:, :D] 2025-05-07T20:33:04.8166918Z x1 = x[:, D:] 2025-05-07T20:33:04.8167254Z 2025-05-07T20:33:04.8167551Z if contiguous: 2025-05-07T20:33:04.8167927Z x0 = x0.contiguous() 2025-05-07T20:33:04.8168352Z x1 = x1.contiguous() 2025-05-07T20:33:04.8168743Z 2025-05-07T20:33:04.8169054Z if scale_ub is not None: 2025-05-07T20:33:04.8169506Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:04.8170062Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:04.8170584Z ) 2025-05-07T20:33:04.8170903Z else: 2025-05-07T20:33:04.8171243Z scale_ub_tensor = None 2025-05-07T20:33:04.8171667Z 2025-05-07T20:33:04.8172043Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:04.8172566Z op = silu_mul_quant 2025-05-07T20:33:04.8172981Z if compiled: 2025-05-07T20:33:04.8173386Z op = torch.compile(op) 2025-05-07T20:33:04.8173865Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:04.8174323Z 2025-05-07T20:33:04.8174762Z > y_fp8, y_scale = fn() 2025-05-07T20:33:04.8175044Z 2025-05-07T20:33:04.8175222Z moe/activation_test.py:117: 2025-05-07T20:33:04.8175699Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:04.8176255Z moe/activation_test.py:115: in fn 2025-05-07T20:33:04.8176721Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:04.8177670Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:04.8178641Z return fn(*args, **kwargs) 2025-05-07T20:33:04.8179791Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:04.8180988Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:04.8181886Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:04.8183405Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:04.8184546Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:04.8185449Z kernel = self.compile( 2025-05-07T20:33:04.8186471Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:04.8187594Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:04.8188340Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:04.8188735Z 2025-05-07T20:33:04.8189074Z self = 2025-05-07T20:33:04.8191124Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:04.8193702Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd1b85ef550>} 2025-05-07T20:33:04.8196087Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:04.8197889Z context = 2025-05-07T20:33:04.8198398Z 2025-05-07T20:33:04.8198671Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:04.8199564Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:04.8200382Z module_map=module_map) 2025-05-07T20:33:04.8200977Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:04.8201566Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:04.8202005Z E ^ 2025-05-07T20:33:04.8202792Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:04.8203602Z 2025-05-07T20:33:04.8204329Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:04.8205247Z 2025-05-07T20:33:04.8205413Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:04.8206109Z self=, 2025-05-07T20:33:04.8206788Z T=4096, 2025-05-07T20:33:04.8207089Z D=7168, 2025-05-07T20:33:04.8207404Z scale_ub=None, 2025-05-07T20:33:04.8207743Z contiguous=False, 2025-05-07T20:33:04.8208107Z compiled=True, 2025-05-07T20:33:04.8208442Z ) 2025-05-07T20:33:04.8208964Z self = 2025-05-07T20:33:04.8209805Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:04.8210443Z 2025-05-07T20:33:04.8210576Z @given( 2025-05-07T20:33:04.8210947Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:04.8211456Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:04.8211962Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:04.8212522Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:04.8213064Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:04.8213552Z ) 2025-05-07T20:33:04.8214139Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:04.8214889Z def test_silu_mul_quant( 2025-05-07T20:33:04.8215284Z self, 2025-05-07T20:33:04.8215598Z T: int, 2025-05-07T20:33:04.8215914Z D: int, 2025-05-07T20:33:04.8216260Z scale_ub: Optional[float], 2025-05-07T20:33:04.8216701Z contiguous: bool, 2025-05-07T20:33:04.8217094Z compiled: bool, 2025-05-07T20:33:04.8217447Z ) -> None: 2025-05-07T20:33:04.8217799Z torch.manual_seed(2025) 2025-05-07T20:33:04.8218198Z 2025-05-07T20:33:04.8218652Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:04.8219236Z 2025-05-07T20:33:04.8219545Z x_sign = torch.sign(x) 2025-05-07T20:33:04.8220087Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:04.8220615Z x = x_sign * x_clamp 2025-05-07T20:33:04.8221065Z x0 = x[:, :D] 2025-05-07T20:33:04.8230620Z x1 = x[:, D:] 2025-05-07T20:33:04.8230950Z 2025-05-07T20:33:04.8231227Z if contiguous: 2025-05-07T20:33:04.8231563Z x0 = x0.contiguous() 2025-05-07T20:33:04.8231955Z x1 = x1.contiguous() 2025-05-07T20:33:04.8232316Z 2025-05-07T20:33:04.8232754Z if scale_ub is not None: 2025-05-07T20:33:04.8233159Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:04.8233655Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:04.8234096Z ) 2025-05-07T20:33:04.8234385Z else: 2025-05-07T20:33:04.8234693Z scale_ub_tensor = None 2025-05-07T20:33:04.8235067Z 2025-05-07T20:33:04.8235405Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:04.8235876Z op = silu_mul_quant 2025-05-07T20:33:04.8236237Z if compiled: 2025-05-07T20:33:04.8236613Z op = torch.compile(op) 2025-05-07T20:33:04.8237071Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:04.8237463Z 2025-05-07T20:33:04.8237764Z > y_fp8, y_scale = fn() 2025-05-07T20:33:04.8238009Z 2025-05-07T20:33:04.8238147Z moe/activation_test.py:117: 2025-05-07T20:33:04.8238579Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:04.8239116Z moe/activation_test.py:115: in fn 2025-05-07T20:33:04.8239566Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:04.8240503Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:04.8241432Z return fn(*args, **kwargs) 2025-05-07T20:33:04.8242544Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:04.8243719Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:04.8244616Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:04.8245758Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:04.8246873Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:04.8247763Z kernel = self.compile( 2025-05-07T20:33:04.8248667Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:04.8249762Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:04.8250508Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:04.8250889Z 2025-05-07T20:33:04.8251220Z self = 2025-05-07T20:33:04.8253070Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:04.8255465Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd1b8367160>} 2025-05-07T20:33:04.8257779Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:04.8259533Z context = 2025-05-07T20:33:04.8260010Z 2025-05-07T20:33:04.8260271Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:04.8261137Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:04.8261963Z module_map=module_map) 2025-05-07T20:33:04.8262537Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:04.8263148Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:04.8263561Z E ^ 2025-05-07T20:33:04.8264324Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:04.8265089Z 2025-05-07T20:33:04.8265786Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:04.8266714Z 2025-05-07T20:33:05.0312302Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:05.0313122Z self=, 2025-05-07T20:33:05.0313822Z T=16384, 2025-05-07T20:33:05.0314137Z D=5120, 2025-05-07T20:33:05.0314436Z scale_ub=1200.0, 2025-05-07T20:33:05.0314801Z contiguous=False, 2025-05-07T20:33:05.0315156Z compiled=False, 2025-05-07T20:33:05.0315469Z ) 2025-05-07T20:33:05.0315955Z self = 2025-05-07T20:33:05.0316792Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:05.0317274Z 2025-05-07T20:33:05.0317406Z @given( 2025-05-07T20:33:05.0317768Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:05.0318290Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:05.0318810Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:05.0319350Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:05.0319898Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:05.0320377Z ) 2025-05-07T20:33:05.0320950Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:05.0321709Z def test_silu_mul_quant( 2025-05-07T20:33:05.0322103Z self, 2025-05-07T20:33:05.0322410Z T: int, 2025-05-07T20:33:05.0322769Z D: int, 2025-05-07T20:33:05.0323117Z scale_ub: Optional[float], 2025-05-07T20:33:05.0323559Z contiguous: bool, 2025-05-07T20:33:05.0323954Z compiled: bool, 2025-05-07T20:33:05.0324313Z ) -> None: 2025-05-07T20:33:05.0324657Z torch.manual_seed(2025) 2025-05-07T20:33:05.0325051Z 2025-05-07T20:33:05.0325489Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:05.0326051Z 2025-05-07T20:33:05.0326364Z x_sign = torch.sign(x) 2025-05-07T20:33:05.0326835Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:05.0327339Z x = x_sign * x_clamp 2025-05-07T20:33:05.0327731Z x0 = x[:, :D] 2025-05-07T20:33:05.0328364Z x1 = x[:, D:] 2025-05-07T20:33:05.0328699Z 2025-05-07T20:33:05.0328996Z if contiguous: 2025-05-07T20:33:05.0329369Z x0 = x0.contiguous() 2025-05-07T20:33:05.0329780Z x1 = x1.contiguous() 2025-05-07T20:33:05.0330171Z 2025-05-07T20:33:05.0330475Z if scale_ub is not None: 2025-05-07T20:33:05.0330912Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:05.0331468Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:05.0331983Z ) 2025-05-07T20:33:05.0332281Z else: 2025-05-07T20:33:05.0332615Z scale_ub_tensor = None 2025-05-07T20:33:05.0333025Z 2025-05-07T20:33:05.0333385Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:05.0333925Z op = silu_mul_quant 2025-05-07T20:33:05.0334336Z if compiled: 2025-05-07T20:33:05.0334731Z op = torch.compile(op) 2025-05-07T20:33:05.0335214Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:05.0335667Z 2025-05-07T20:33:05.0335969Z > y_fp8, y_scale = fn() 2025-05-07T20:33:05.0336237Z 2025-05-07T20:33:05.0336392Z moe/activation_test.py:117: 2025-05-07T20:33:05.0336877Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:05.0337568Z moe/activation_test.py:115: in fn 2025-05-07T20:33:05.0338024Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:05.0339319Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:05.0340495Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:05.0341387Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:05.0342719Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:05.0343849Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:05.0344739Z kernel = self.compile( 2025-05-07T20:33:05.0345642Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:05.0346771Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:05.0347449Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:05.0347846Z 2025-05-07T20:33:05.0348192Z self = 2025-05-07T20:33:05.0350232Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:05.0352717Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd1b8367940>} 2025-05-07T20:33:05.0355094Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:05.0356894Z context = 2025-05-07T20:33:05.0357381Z 2025-05-07T20:33:05.0357673Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:05.0358565Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:05.0359374Z module_map=module_map) 2025-05-07T20:33:05.0359975Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:05.0360549Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:05.0360984Z E ^ 2025-05-07T20:33:05.0361869Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:05.0362677Z 2025-05-07T20:33:05.0363416Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:05.0364322Z 2025-05-07T20:33:05.0364488Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:05.0365185Z self=, 2025-05-07T20:33:05.0365872Z T=16384, 2025-05-07T20:33:05.0366173Z D=5120, 2025-05-07T20:33:05.0366489Z scale_ub=1200.0, 2025-05-07T20:33:05.0366847Z contiguous=True, 2025-05-07T20:33:05.0367198Z compiled=True, 2025-05-07T20:33:05.0367530Z ) 2025-05-07T20:33:05.0368060Z self = 2025-05-07T20:33:05.0368915Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:05.0369391Z 2025-05-07T20:33:05.0369516Z @given( 2025-05-07T20:33:05.0369889Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:05.0370408Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:05.0370906Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:05.0371456Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:05.0372014Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:05.0372600Z ) 2025-05-07T20:33:05.0373195Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:05.0374006Z def test_silu_mul_quant( 2025-05-07T20:33:05.0374398Z self, 2025-05-07T20:33:05.0374705Z T: int, 2025-05-07T20:33:05.0375020Z D: int, 2025-05-07T20:33:05.0375371Z scale_ub: Optional[float], 2025-05-07T20:33:05.0375810Z contiguous: bool, 2025-05-07T20:33:05.0376263Z compiled: bool, 2025-05-07T20:33:05.0376627Z ) -> None: 2025-05-07T20:33:05.0376967Z torch.manual_seed(2025) 2025-05-07T20:33:05.0377364Z 2025-05-07T20:33:05.0377808Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:05.0378381Z 2025-05-07T20:33:05.0378688Z x_sign = torch.sign(x) 2025-05-07T20:33:05.0379162Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:05.0379666Z x = x_sign * x_clamp 2025-05-07T20:33:05.0380053Z x0 = x[:, :D] 2025-05-07T20:33:05.0380406Z x1 = x[:, D:] 2025-05-07T20:33:05.0380731Z 2025-05-07T20:33:05.0381032Z if contiguous: 2025-05-07T20:33:05.0381405Z x0 = x0.contiguous() 2025-05-07T20:33:05.0381820Z x1 = x1.contiguous() 2025-05-07T20:33:05.0382217Z 2025-05-07T20:33:05.0382526Z if scale_ub is not None: 2025-05-07T20:33:05.0383186Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:05.0383710Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:05.0384118Z ) 2025-05-07T20:33:05.0384374Z else: 2025-05-07T20:33:05.0384646Z scale_ub_tensor = None 2025-05-07T20:33:05.0385008Z 2025-05-07T20:33:05.0385311Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:05.0385753Z op = silu_mul_quant 2025-05-07T20:33:05.0386094Z if compiled: 2025-05-07T20:33:05.0386425Z op = torch.compile(op) 2025-05-07T20:33:05.0386830Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:05.0387211Z 2025-05-07T20:33:05.0387465Z > y_fp8, y_scale = fn() 2025-05-07T20:33:05.0387691Z 2025-05-07T20:33:05.0387824Z moe/activation_test.py:117: 2025-05-07T20:33:05.0388238Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:05.0388737Z moe/activation_test.py:115: in fn 2025-05-07T20:33:05.0389153Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:05.0390146Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:05.0391010Z return fn(*args, **kwargs) 2025-05-07T20:33:05.0392142Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:05.0393213Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:05.0394011Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:05.0395035Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:05.0396098Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:05.0396973Z kernel = self.compile( 2025-05-07T20:33:05.0397861Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:05.0398949Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:05.0399625Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:05.0400009Z 2025-05-07T20:33:05.0400359Z self = 2025-05-07T20:33:05.0402427Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:05.0404811Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd1b81d7550>} 2025-05-07T20:33:05.0407280Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:05.0409201Z context = 2025-05-07T20:33:05.0409701Z 2025-05-07T20:33:05.0409990Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:05.0410880Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:05.0411684Z module_map=module_map) 2025-05-07T20:33:05.0412289Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:05.0412881Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:05.0413301Z E ^ 2025-05-07T20:33:05.0414103Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:05.0414902Z 2025-05-07T20:33:05.0415639Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:05.0416546Z 2025-05-07T20:33:05.2649509Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:05.2650290Z self=, 2025-05-07T20:33:05.2650959Z T=16384, 2025-05-07T20:33:05.2651306Z D=5120, 2025-05-07T20:33:05.2651617Z scale_ub=None, 2025-05-07T20:33:05.2651968Z contiguous=False, 2025-05-07T20:33:05.2652368Z compiled=True, 2025-05-07T20:33:05.2652702Z ) 2025-05-07T20:33:05.2653204Z self = 2025-05-07T20:33:05.2654078Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:05.2654502Z 2025-05-07T20:33:05.2654619Z @given( 2025-05-07T20:33:05.2654939Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:05.2655384Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:05.2655825Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:05.2656318Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:05.2656825Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:05.2657271Z ) 2025-05-07T20:33:05.2657827Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:05.2658814Z def test_silu_mul_quant( 2025-05-07T20:33:05.2659194Z self, 2025-05-07T20:33:05.2659486Z T: int, 2025-05-07T20:33:05.2659771Z D: int, 2025-05-07T20:33:05.2660111Z scale_ub: Optional[float], 2025-05-07T20:33:05.2660555Z contiguous: bool, 2025-05-07T20:33:05.2660927Z compiled: bool, 2025-05-07T20:33:05.2661284Z ) -> None: 2025-05-07T20:33:05.2661628Z torch.manual_seed(2025) 2025-05-07T20:33:05.2662034Z 2025-05-07T20:33:05.2662487Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:05.2663042Z 2025-05-07T20:33:05.2663342Z x_sign = torch.sign(x) 2025-05-07T20:33:05.2663809Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:05.2664345Z x = x_sign * x_clamp 2025-05-07T20:33:05.2664741Z x0 = x[:, :D] 2025-05-07T20:33:05.2665085Z x1 = x[:, D:] 2025-05-07T20:33:05.2665417Z 2025-05-07T20:33:05.2665716Z if contiguous: 2025-05-07T20:33:05.2666092Z x0 = x0.contiguous() 2025-05-07T20:33:05.2666521Z x1 = x1.contiguous() 2025-05-07T20:33:05.2666917Z 2025-05-07T20:33:05.2667223Z if scale_ub is not None: 2025-05-07T20:33:05.2667675Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:05.2668375Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:05.2668892Z ) 2025-05-07T20:33:05.2669303Z else: 2025-05-07T20:33:05.2669641Z scale_ub_tensor = None 2025-05-07T20:33:05.2670225Z 2025-05-07T20:33:05.2670601Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:05.2671149Z op = silu_mul_quant 2025-05-07T20:33:05.2671573Z if compiled: 2025-05-07T20:33:05.2671979Z op = torch.compile(op) 2025-05-07T20:33:05.2672647Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:05.2673117Z 2025-05-07T20:33:05.2673419Z > y_fp8, y_scale = fn() 2025-05-07T20:33:05.2673697Z 2025-05-07T20:33:05.2673862Z moe/activation_test.py:117: 2025-05-07T20:33:05.2674353Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:05.2674904Z moe/activation_test.py:115: in fn 2025-05-07T20:33:05.2675374Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:05.2676340Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:05.2677324Z return fn(*args, **kwargs) 2025-05-07T20:33:05.2678454Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:05.2679664Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:05.2680613Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:05.2681803Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:05.2683308Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:05.2684237Z kernel = self.compile( 2025-05-07T20:33:05.2685169Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:05.2686294Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:05.2686953Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:05.2687341Z 2025-05-07T20:33:05.2687670Z self = 2025-05-07T20:33:05.2689504Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:05.2692068Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd1b83471f0>} 2025-05-07T20:33:05.2694420Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:05.2696228Z context = 2025-05-07T20:33:05.2696726Z 2025-05-07T20:33:05.2697013Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:05.2697897Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:05.2698678Z module_map=module_map) 2025-05-07T20:33:05.2699273Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:05.2699853Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:05.2700272Z E ^ 2025-05-07T20:33:05.2701068Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:05.2701875Z 2025-05-07T20:33:05.2702669Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:05.2703577Z 2025-05-07T20:33:05.2703893Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:05.2704590Z self=, 2025-05-07T20:33:05.2705335Z T=2048, 2025-05-07T20:33:05.2705635Z D=5120, 2025-05-07T20:33:05.2705929Z scale_ub=None, 2025-05-07T20:33:05.2706276Z contiguous=False, 2025-05-07T20:33:05.2706636Z compiled=True, 2025-05-07T20:33:05.2706954Z ) 2025-05-07T20:33:05.3911519Z self = 2025-05-07T20:33:05.3912796Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:05.3913292Z 2025-05-07T20:33:05.3913419Z @given( 2025-05-07T20:33:05.3913796Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:05.3914289Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:05.3914760Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:05.3915265Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:05.3915815Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:05.3916284Z ) 2025-05-07T20:33:05.3916885Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:05.3917646Z def test_silu_mul_quant( 2025-05-07T20:33:05.3918038Z self, 2025-05-07T20:33:05.3918354Z T: int, 2025-05-07T20:33:05.3918671Z D: int, 2025-05-07T20:33:05.3919014Z scale_ub: Optional[float], 2025-05-07T20:33:05.3919466Z contiguous: bool, 2025-05-07T20:33:05.3919861Z compiled: bool, 2025-05-07T20:33:05.3920218Z ) -> None: 2025-05-07T20:33:05.3920571Z torch.manual_seed(2025) 2025-05-07T20:33:05.3920963Z 2025-05-07T20:33:05.3921399Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:05.3921976Z 2025-05-07T20:33:05.3922296Z x_sign = torch.sign(x) 2025-05-07T20:33:05.3922762Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:05.3923280Z x = x_sign * x_clamp 2025-05-07T20:33:05.3923676Z x0 = x[:, :D] 2025-05-07T20:33:05.3924029Z x1 = x[:, D:] 2025-05-07T20:33:05.3924359Z 2025-05-07T20:33:05.3924656Z if contiguous: 2025-05-07T20:33:05.3925028Z x0 = x0.contiguous() 2025-05-07T20:33:05.3925450Z x1 = x1.contiguous() 2025-05-07T20:33:05.3925849Z 2025-05-07T20:33:05.3926154Z if scale_ub is not None: 2025-05-07T20:33:05.3926589Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:05.3927145Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:05.3927657Z ) 2025-05-07T20:33:05.3927956Z else: 2025-05-07T20:33:05.3928430Z scale_ub_tensor = None 2025-05-07T20:33:05.3928850Z 2025-05-07T20:33:05.3929212Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:05.3929739Z op = silu_mul_quant 2025-05-07T20:33:05.3930150Z if compiled: 2025-05-07T20:33:05.3930542Z op = torch.compile(op) 2025-05-07T20:33:05.3931033Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:05.3931494Z 2025-05-07T20:33:05.3931802Z > y_fp8, y_scale = fn() 2025-05-07T20:33:05.3932077Z 2025-05-07T20:33:05.3932239Z moe/activation_test.py:117: 2025-05-07T20:33:05.3932730Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:05.3933293Z moe/activation_test.py:115: in fn 2025-05-07T20:33:05.3933751Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:05.3934710Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:05.3935681Z return fn(*args, **kwargs) 2025-05-07T20:33:05.3936804Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:05.3938002Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:05.3939015Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:05.3940177Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:05.3941392Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:05.3942298Z kernel = self.compile( 2025-05-07T20:33:05.3943195Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:05.3944389Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:05.3945053Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:05.3945459Z 2025-05-07T20:33:05.3945797Z self = 2025-05-07T20:33:05.3947705Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:05.3950367Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd1b8347f70>} 2025-05-07T20:33:05.3952742Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:05.3964816Z context = 2025-05-07T20:33:05.3965341Z 2025-05-07T20:33:05.3965642Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:05.3966539Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:05.3967350Z module_map=module_map) 2025-05-07T20:33:05.3967970Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:05.3968549Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:05.3968988Z E ^ 2025-05-07T20:33:05.3969792Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:05.3970593Z 2025-05-07T20:33:05.3971337Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:05.3972256Z 2025-05-07T20:33:05.3972424Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:05.3973125Z self=, 2025-05-07T20:33:05.3973908Z T=2048, 2025-05-07T20:33:05.3974213Z D=5120, 2025-05-07T20:33:05.3974525Z scale_ub=1200.0, 2025-05-07T20:33:05.3974887Z contiguous=False, 2025-05-07T20:33:05.3975250Z compiled=True, 2025-05-07T20:33:05.3975584Z ) 2025-05-07T20:33:05.3976119Z self = 2025-05-07T20:33:05.3976955Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:05.3977439Z 2025-05-07T20:33:05.3977564Z @given( 2025-05-07T20:33:05.3977936Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:05.3978458Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:05.3978959Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:05.3979510Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:05.3980065Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:05.3980512Z ) 2025-05-07T20:33:05.3981037Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:05.3981669Z def test_silu_mul_quant( 2025-05-07T20:33:05.3981999Z self, 2025-05-07T20:33:05.3982265Z T: int, 2025-05-07T20:33:05.3982542Z D: int, 2025-05-07T20:33:05.3983198Z scale_ub: Optional[float], 2025-05-07T20:33:05.3983581Z contiguous: bool, 2025-05-07T20:33:05.3984063Z compiled: bool, 2025-05-07T20:33:05.3984448Z ) -> None: 2025-05-07T20:33:05.3984727Z torch.manual_seed(2025) 2025-05-07T20:33:05.3985052Z 2025-05-07T20:33:05.3985440Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:05.3985982Z 2025-05-07T20:33:05.3986259Z x_sign = torch.sign(x) 2025-05-07T20:33:05.3986685Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:05.3987298Z x = x_sign * x_clamp 2025-05-07T20:33:05.3987642Z x0 = x[:, :D] 2025-05-07T20:33:05.3987956Z x1 = x[:, D:] 2025-05-07T20:33:05.3988259Z 2025-05-07T20:33:05.3988528Z if contiguous: 2025-05-07T20:33:05.3988851Z x0 = x0.contiguous() 2025-05-07T20:33:05.3989213Z x1 = x1.contiguous() 2025-05-07T20:33:05.3989543Z 2025-05-07T20:33:05.3989924Z if scale_ub is not None: 2025-05-07T20:33:05.3990337Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:05.3990830Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:05.3991290Z ) 2025-05-07T20:33:05.3991565Z else: 2025-05-07T20:33:05.3991861Z scale_ub_tensor = None 2025-05-07T20:33:05.3992236Z 2025-05-07T20:33:05.3992576Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:05.3993061Z op = silu_mul_quant 2025-05-07T20:33:05.3993462Z if compiled: 2025-05-07T20:33:05.3993853Z op = torch.compile(op) 2025-05-07T20:33:05.3994275Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:05.3994711Z 2025-05-07T20:33:05.3995010Z > y_fp8, y_scale = fn() 2025-05-07T20:33:05.3995270Z 2025-05-07T20:33:05.3995432Z moe/activation_test.py:117: 2025-05-07T20:33:05.3995892Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:05.3996430Z moe/activation_test.py:115: in fn 2025-05-07T20:33:05.3996877Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:05.3997790Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:05.3998729Z return fn(*args, **kwargs) 2025-05-07T20:33:05.3999838Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:05.4001013Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:05.4001895Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:05.4003177Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:05.4004292Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:05.4005182Z kernel = self.compile( 2025-05-07T20:33:05.4006073Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:05.4007124Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:05.4007766Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:05.4008142Z 2025-05-07T20:33:05.4008471Z self = 2025-05-07T20:33:05.4010278Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:05.4012726Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd1b8195940>} 2025-05-07T20:33:05.4015114Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:05.4016831Z context = 2025-05-07T20:33:05.4017401Z 2025-05-07T20:33:05.4017672Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:05.4018573Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:05.4019386Z module_map=module_map) 2025-05-07T20:33:05.4020052Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:05.4020642Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:05.4021072Z E ^ 2025-05-07T20:33:05.4021878Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:05.4022681Z 2025-05-07T20:33:05.4023412Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:05.4024333Z 2025-05-07T20:33:05.8001111Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:05.8001903Z self=, 2025-05-07T20:33:05.8002645Z T=4096, 2025-05-07T20:33:05.8002944Z D=5120, 2025-05-07T20:33:05.8003250Z scale_ub=1200.0, 2025-05-07T20:33:05.8003618Z contiguous=True, 2025-05-07T20:33:05.8003960Z compiled=True, 2025-05-07T20:33:05.8004290Z ) 2025-05-07T20:33:05.8004819Z self = 2025-05-07T20:33:05.8005645Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:05.8006073Z 2025-05-07T20:33:05.8006189Z @given( 2025-05-07T20:33:05.8006503Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:05.8006944Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:05.8007372Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:05.8007855Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:05.8008346Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:05.8008779Z ) 2025-05-07T20:33:05.8009332Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:05.8010043Z def test_silu_mul_quant( 2025-05-07T20:33:05.8010383Z self, 2025-05-07T20:33:05.8010664Z T: int, 2025-05-07T20:33:05.8010957Z D: int, 2025-05-07T20:33:05.8011270Z scale_ub: Optional[float], 2025-05-07T20:33:05.8011681Z contiguous: bool, 2025-05-07T20:33:05.8012064Z compiled: bool, 2025-05-07T20:33:05.8012403Z ) -> None: 2025-05-07T20:33:05.8013065Z torch.manual_seed(2025) 2025-05-07T20:33:05.8013471Z 2025-05-07T20:33:05.8013931Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:05.8014498Z 2025-05-07T20:33:05.8014790Z x_sign = torch.sign(x) 2025-05-07T20:33:05.8015250Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:05.8015764Z x = x_sign * x_clamp 2025-05-07T20:33:05.8016157Z x0 = x[:, :D] 2025-05-07T20:33:05.8016508Z x1 = x[:, D:] 2025-05-07T20:33:05.8016833Z 2025-05-07T20:33:05.8017131Z if contiguous: 2025-05-07T20:33:05.8017503Z x0 = x0.contiguous() 2025-05-07T20:33:05.8017923Z x1 = x1.contiguous() 2025-05-07T20:33:05.8018320Z 2025-05-07T20:33:05.8018634Z if scale_ub is not None: 2025-05-07T20:33:05.8019078Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:05.8019637Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:05.8020164Z ) 2025-05-07T20:33:05.8020465Z else: 2025-05-07T20:33:05.8020798Z scale_ub_tensor = None 2025-05-07T20:33:05.8021203Z 2025-05-07T20:33:05.8021579Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:05.8022091Z op = silu_mul_quant 2025-05-07T20:33:05.8022497Z if compiled: 2025-05-07T20:33:05.8023046Z op = torch.compile(op) 2025-05-07T20:33:05.8023626Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:05.8024084Z 2025-05-07T20:33:05.8024390Z > y_fp8, y_scale = fn() 2025-05-07T20:33:05.8024663Z 2025-05-07T20:33:05.8024820Z moe/activation_test.py:117: 2025-05-07T20:33:05.8025308Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:05.8025998Z moe/activation_test.py:115: in fn 2025-05-07T20:33:05.8026451Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:05.8027410Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:05.8028376Z return fn(*args, **kwargs) 2025-05-07T20:33:05.8029513Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:05.8030926Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:05.8031857Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:05.8033102Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:05.8034250Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:05.8035169Z kernel = self.compile( 2025-05-07T20:33:05.8036083Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:05.8037244Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:05.8037905Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:05.8038294Z 2025-05-07T20:33:05.8038630Z self = 2025-05-07T20:33:05.8040445Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:05.8042881Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd1b80f7790>} 2025-05-07T20:33:05.8045226Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:05.8047010Z context = 2025-05-07T20:33:05.8047600Z 2025-05-07T20:33:05.8047878Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:05.8048774Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:05.8049579Z module_map=module_map) 2025-05-07T20:33:05.8050170Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:05.8050742Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:05.8051155Z E ^ 2025-05-07T20:33:05.8051935Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:05.8052763Z 2025-05-07T20:33:05.8053489Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:05.8054400Z 2025-05-07T20:33:05.8054574Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:05.8055272Z self=, 2025-05-07T20:33:05.8055945Z T=128, 2025-05-07T20:33:05.8056244Z D=5120, 2025-05-07T20:33:05.8056554Z scale_ub=1200.0, 2025-05-07T20:33:05.8056897Z contiguous=False, 2025-05-07T20:33:05.8057252Z compiled=True, 2025-05-07T20:33:05.8057577Z ) 2025-05-07T20:33:05.9381438Z self = 2025-05-07T20:33:05.9383140Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:05.9383467Z 2025-05-07T20:33:05.9383548Z @given( 2025-05-07T20:33:05.9383784Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:05.9384107Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:05.9384423Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:05.9384871Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:05.9385206Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:05.9385503Z ) 2025-05-07T20:33:05.9385871Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:05.9386340Z def test_silu_mul_quant( 2025-05-07T20:33:05.9386576Z self, 2025-05-07T20:33:05.9386771Z T: int, 2025-05-07T20:33:05.9386969Z D: int, 2025-05-07T20:33:05.9387183Z scale_ub: Optional[float], 2025-05-07T20:33:05.9387457Z contiguous: bool, 2025-05-07T20:33:05.9387702Z compiled: bool, 2025-05-07T20:33:05.9387925Z ) -> None: 2025-05-07T20:33:05.9388142Z torch.manual_seed(2025) 2025-05-07T20:33:05.9388386Z 2025-05-07T20:33:05.9388656Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:05.9389020Z 2025-05-07T20:33:05.9389213Z x_sign = torch.sign(x) 2025-05-07T20:33:05.9389507Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:05.9389939Z x = x_sign * x_clamp 2025-05-07T20:33:05.9390190Z x0 = x[:, :D] 2025-05-07T20:33:05.9390407Z x1 = x[:, D:] 2025-05-07T20:33:05.9390615Z 2025-05-07T20:33:05.9390807Z if contiguous: 2025-05-07T20:33:05.9391034Z x0 = x0.contiguous() 2025-05-07T20:33:05.9391300Z x1 = x1.contiguous() 2025-05-07T20:33:05.9391547Z 2025-05-07T20:33:05.9391743Z if scale_ub is not None: 2025-05-07T20:33:05.9392017Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:05.9392363Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:05.9392681Z ) 2025-05-07T20:33:05.9392865Z else: 2025-05-07T20:33:05.9393074Z scale_ub_tensor = None 2025-05-07T20:33:05.9393331Z 2025-05-07T20:33:05.9393553Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:05.9393878Z op = silu_mul_quant 2025-05-07T20:33:05.9394132Z if compiled: 2025-05-07T20:33:05.9394372Z op = torch.compile(op) 2025-05-07T20:33:05.9394672Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:05.9395033Z 2025-05-07T20:33:05.9395222Z > y_fp8, y_scale = fn() 2025-05-07T20:33:05.9395395Z 2025-05-07T20:33:05.9395493Z moe/activation_test.py:117: 2025-05-07T20:33:05.9395795Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:05.9396140Z moe/activation_test.py:115: in fn 2025-05-07T20:33:05.9396424Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:05.9397020Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:05.9397617Z return fn(*args, **kwargs) 2025-05-07T20:33:05.9398315Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:05.9399058Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:05.9399620Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:05.9400353Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:05.9401056Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:05.9401624Z kernel = self.compile( 2025-05-07T20:33:05.9402288Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:05.9403060Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:05.9403473Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:05.9403719Z 2025-05-07T20:33:05.9403930Z self = 2025-05-07T20:33:05.9405102Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:05.9406737Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd1b80690d0>} 2025-05-07T20:33:05.9408207Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:05.9409318Z context = 2025-05-07T20:33:05.9409622Z 2025-05-07T20:33:05.9409800Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:05.9410348Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:05.9410838Z module_map=module_map) 2025-05-07T20:33:05.9411217Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:05.9411584Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:05.9411846Z E ^ 2025-05-07T20:33:05.9412339Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:05.9412833Z 2025-05-07T20:33:05.9413284Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:05.9413838Z 2025-05-07T20:33:05.9413947Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:05.9414369Z self=, 2025-05-07T20:33:05.9414794Z T=16384, 2025-05-07T20:33:05.9414994Z D=7168, 2025-05-07T20:33:05.9415180Z scale_ub=1200.0, 2025-05-07T20:33:05.9415408Z contiguous=True, 2025-05-07T20:33:05.9415633Z compiled=True, 2025-05-07T20:33:05.9415834Z ) 2025-05-07T20:33:05.9416158Z self = 2025-05-07T20:33:05.9416722Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:05.9417016Z 2025-05-07T20:33:05.9417100Z @given( 2025-05-07T20:33:05.9417323Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:05.9417647Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:05.9417961Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:05.9418297Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:05.9418635Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:05.9418926Z ) 2025-05-07T20:33:05.9419284Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:05.9419751Z def test_silu_mul_quant( 2025-05-07T20:33:05.9419995Z self, 2025-05-07T20:33:05.9420181Z T: int, 2025-05-07T20:33:05.9420382Z D: int, 2025-05-07T20:33:05.9420598Z scale_ub: Optional[float], 2025-05-07T20:33:05.9420874Z contiguous: bool, 2025-05-07T20:33:05.9421109Z compiled: bool, 2025-05-07T20:33:05.9421334Z ) -> None: 2025-05-07T20:33:05.9421547Z torch.manual_seed(2025) 2025-05-07T20:33:05.9421785Z 2025-05-07T20:33:05.9422058Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:05.9422408Z 2025-05-07T20:33:05.9422596Z x_sign = torch.sign(x) 2025-05-07T20:33:05.9422934Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:05.9423287Z x = x_sign * x_clamp 2025-05-07T20:33:05.9423524Z x0 = x[:, :D] 2025-05-07T20:33:05.9423743Z x1 = x[:, D:] 2025-05-07T20:33:05.9423950Z 2025-05-07T20:33:05.9424128Z if contiguous: 2025-05-07T20:33:05.9424361Z x0 = x0.contiguous() 2025-05-07T20:33:05.9424625Z x1 = x1.contiguous() 2025-05-07T20:33:05.9424909Z 2025-05-07T20:33:05.9425105Z if scale_ub is not None: 2025-05-07T20:33:05.9425397Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:05.9425738Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:05.9426066Z ) 2025-05-07T20:33:05.9426256Z else: 2025-05-07T20:33:05.9426468Z scale_ub_tensor = None 2025-05-07T20:33:05.9426718Z 2025-05-07T20:33:05.9426952Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:05.9427280Z op = silu_mul_quant 2025-05-07T20:33:05.9427530Z if compiled: 2025-05-07T20:33:05.9427783Z op = torch.compile(op) 2025-05-07T20:33:05.9428095Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:05.9428375Z 2025-05-07T20:33:05.9428566Z > y_fp8, y_scale = fn() 2025-05-07T20:33:05.9428731Z 2025-05-07T20:33:05.9428833Z moe/activation_test.py:117: 2025-05-07T20:33:05.9429130Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:05.9429476Z moe/activation_test.py:115: in fn 2025-05-07T20:33:05.9429760Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:05.9430443Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:05.9431035Z return fn(*args, **kwargs) 2025-05-07T20:33:05.9431740Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:05.9432511Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:05.9433094Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:05.9433829Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:05.9434537Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:05.9435105Z kernel = self.compile( 2025-05-07T20:33:05.9435673Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:05.9436421Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:05.9436842Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:05.9437083Z 2025-05-07T20:33:05.9437303Z self = 2025-05-07T20:33:05.9438469Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:05.9439972Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd1b8069d30>} 2025-05-07T20:33:05.9441439Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:05.9442580Z context = 2025-05-07T20:33:05.9442909Z 2025-05-07T20:33:05.9443076Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:05.9443624Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:05.9444162Z module_map=module_map) 2025-05-07T20:33:05.9444577Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:05.9444931Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:05.9445197Z E ^ 2025-05-07T20:33:05.9445692Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:05.9446178Z 2025-05-07T20:33:05.9446666Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:05.9447228Z 2025-05-07T20:33:06.2195436Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:06.2196102Z self=, 2025-05-07T20:33:06.2196708Z T=16384, 2025-05-07T20:33:06.2196951Z D=5120, 2025-05-07T20:33:06.2197168Z scale_ub=1200.0, 2025-05-07T20:33:06.2197406Z contiguous=True, 2025-05-07T20:33:06.2197639Z compiled=False, 2025-05-07T20:33:06.2197846Z ) 2025-05-07T20:33:06.2198261Z self = 2025-05-07T20:33:06.2198960Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:06.2199263Z 2025-05-07T20:33:06.2199342Z @given( 2025-05-07T20:33:06.2199639Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:06.2200091Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:06.2200533Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:06.2201000Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:06.2201427Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:06.2201748Z ) 2025-05-07T20:33:06.2202121Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:06.2202591Z def test_silu_mul_quant( 2025-05-07T20:33:06.2202834Z self, 2025-05-07T20:33:06.2203031Z T: int, 2025-05-07T20:33:06.2203245Z D: int, 2025-05-07T20:33:06.2211706Z scale_ub: Optional[float], 2025-05-07T20:33:06.2211996Z contiguous: bool, 2025-05-07T20:33:06.2212235Z compiled: bool, 2025-05-07T20:33:06.2212492Z ) -> None: 2025-05-07T20:33:06.2212737Z torch.manual_seed(2025) 2025-05-07T20:33:06.2212982Z 2025-05-07T20:33:06.2213259Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:06.2213628Z 2025-05-07T20:33:06.2213821Z x_sign = torch.sign(x) 2025-05-07T20:33:06.2214125Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:06.2214455Z x = x_sign * x_clamp 2025-05-07T20:33:06.2214860Z x0 = x[:, :D] 2025-05-07T20:33:06.2215096Z x1 = x[:, D:] 2025-05-07T20:33:06.2215320Z 2025-05-07T20:33:06.2215507Z if contiguous: 2025-05-07T20:33:06.2215753Z x0 = x0.contiguous() 2025-05-07T20:33:06.2216023Z x1 = x1.contiguous() 2025-05-07T20:33:06.2216266Z 2025-05-07T20:33:06.2216474Z if scale_ub is not None: 2025-05-07T20:33:06.2216762Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:06.2217114Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:06.2217435Z ) 2025-05-07T20:33:06.2217630Z else: 2025-05-07T20:33:06.2217846Z scale_ub_tensor = None 2025-05-07T20:33:06.2218102Z 2025-05-07T20:33:06.2218342Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:06.2218676Z op = silu_mul_quant 2025-05-07T20:33:06.2218930Z if compiled: 2025-05-07T20:33:06.2219188Z op = torch.compile(op) 2025-05-07T20:33:06.2219496Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:06.2219771Z 2025-05-07T20:33:06.2219970Z > y_fp8, y_scale = fn() 2025-05-07T20:33:06.2220138Z 2025-05-07T20:33:06.2220245Z moe/activation_test.py:117: 2025-05-07T20:33:06.2220548Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:06.2220969Z moe/activation_test.py:115: in fn 2025-05-07T20:33:06.2221320Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:06.2222065Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:06.2222817Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:06.2223387Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:06.2224188Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:06.2224897Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:06.2225472Z kernel = self.compile( 2025-05-07T20:33:06.2226047Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:06.2226744Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:06.2227156Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:06.2227404Z 2025-05-07T20:33:06.2227615Z self = 2025-05-07T20:33:06.2228788Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:06.2230391Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd1b8018700>} 2025-05-07T20:33:06.2231859Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:06.2232967Z context = 2025-05-07T20:33:06.2233281Z 2025-05-07T20:33:06.2233449Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:06.2233997Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:06.2234483Z module_map=module_map) 2025-05-07T20:33:06.2234859Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:06.2235229Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:06.2235499Z E ^ 2025-05-07T20:33:06.2236045Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:06.2236541Z 2025-05-07T20:33:06.2236991Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:06.2237544Z 2025-05-07T20:33:06.2237650Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:06.2238073Z self=, 2025-05-07T20:33:06.2238496Z T=1, 2025-05-07T20:33:06.2238676Z D=7168, 2025-05-07T20:33:06.2238862Z scale_ub=1200.0, 2025-05-07T20:33:06.2239084Z contiguous=False, 2025-05-07T20:33:06.2239317Z compiled=False, 2025-05-07T20:33:06.2239515Z ) 2025-05-07T20:33:06.2239838Z self = 2025-05-07T20:33:06.2240358Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:06.2240637Z 2025-05-07T20:33:06.2240718Z @given( 2025-05-07T20:33:06.2240946Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:06.2241267Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:06.2241581Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:06.2241914Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:06.2242260Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:06.2242605Z ) 2025-05-07T20:33:06.2242962Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:06.2243466Z def test_silu_mul_quant( 2025-05-07T20:33:06.2243711Z self, 2025-05-07T20:33:06.2243900Z T: int, 2025-05-07T20:33:06.2244104Z D: int, 2025-05-07T20:33:06.2244333Z scale_ub: Optional[float], 2025-05-07T20:33:06.2244612Z contiguous: bool, 2025-05-07T20:33:06.2244893Z compiled: bool, 2025-05-07T20:33:06.2245116Z ) -> None: 2025-05-07T20:33:06.2245331Z torch.manual_seed(2025) 2025-05-07T20:33:06.2245568Z 2025-05-07T20:33:06.2245844Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:06.2246199Z 2025-05-07T20:33:06.2246384Z x_sign = torch.sign(x) 2025-05-07T20:33:06.2246683Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:06.2247004Z x = x_sign * x_clamp 2025-05-07T20:33:06.2247251Z x0 = x[:, :D] 2025-05-07T20:33:06.2247474Z x1 = x[:, D:] 2025-05-07T20:33:06.2247688Z 2025-05-07T20:33:06.2247869Z if contiguous: 2025-05-07T20:33:06.2248102Z x0 = x0.contiguous() 2025-05-07T20:33:06.2248363Z x1 = x1.contiguous() 2025-05-07T20:33:06.2248601Z 2025-05-07T20:33:06.2248790Z if scale_ub is not None: 2025-05-07T20:33:06.2249064Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:06.2249410Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:06.2249728Z ) 2025-05-07T20:33:06.2249923Z else: 2025-05-07T20:33:06.2250152Z scale_ub_tensor = None 2025-05-07T20:33:06.2250414Z 2025-05-07T20:33:06.2250645Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:06.2250968Z op = silu_mul_quant 2025-05-07T20:33:06.2251215Z if compiled: 2025-05-07T20:33:06.2251465Z op = torch.compile(op) 2025-05-07T20:33:06.2251769Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:06.2252047Z 2025-05-07T20:33:06.2252240Z > y_fp8, y_scale = fn() 2025-05-07T20:33:06.2252406Z 2025-05-07T20:33:06.2252510Z moe/activation_test.py:117: 2025-05-07T20:33:06.2252813Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:06.2253166Z moe/activation_test.py:115: in fn 2025-05-07T20:33:06.2253457Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:06.2254199Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:06.2254985Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:06.2255560Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:06.2256297Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:06.2257005Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:06.2257576Z kernel = self.compile( 2025-05-07T20:33:06.2258150Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:06.2258851Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:06.2259260Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:06.2259513Z 2025-05-07T20:33:06.2259728Z self = 2025-05-07T20:33:06.2260905Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:06.2262457Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd1b7f530d0>} 2025-05-07T20:33:06.2263969Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:06.2265072Z context = 2025-05-07T20:33:06.2265385Z 2025-05-07T20:33:06.2265596Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:06.2266150Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:06.2266641Z module_map=module_map) 2025-05-07T20:33:06.2267017Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:06.2267379Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:06.2267646Z E ^ 2025-05-07T20:33:06.2268132Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:06.2268626Z 2025-05-07T20:33:06.2269076Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:06.2269631Z 2025-05-07T20:33:06.2269740Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:06.2270236Z self=, 2025-05-07T20:33:06.2270652Z T=4096, 2025-05-07T20:33:06.2270835Z D=7168, 2025-05-07T20:33:06.2271029Z scale_ub=1200.0, 2025-05-07T20:33:06.2271248Z contiguous=False, 2025-05-07T20:33:06.2271474Z compiled=True, 2025-05-07T20:33:06.2271680Z ) 2025-05-07T20:33:06.3438074Z self = 2025-05-07T20:33:06.3438871Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:06.3439242Z 2025-05-07T20:33:06.3439330Z @given( 2025-05-07T20:33:06.3439599Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:06.3439940Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:06.3440279Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:06.3440631Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:06.3440985Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:06.3441297Z ) 2025-05-07T20:33:06.3441669Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:06.3442160Z def test_silu_mul_quant( 2025-05-07T20:33:06.3442421Z self, 2025-05-07T20:33:06.3442621Z T: int, 2025-05-07T20:33:06.3442833Z D: int, 2025-05-07T20:33:06.3443353Z scale_ub: Optional[float], 2025-05-07T20:33:06.3443639Z contiguous: bool, 2025-05-07T20:33:06.3443894Z compiled: bool, 2025-05-07T20:33:06.3444136Z ) -> None: 2025-05-07T20:33:06.3444355Z torch.manual_seed(2025) 2025-05-07T20:33:06.3444613Z 2025-05-07T20:33:06.3444906Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:06.3445283Z 2025-05-07T20:33:06.3445480Z x_sign = torch.sign(x) 2025-05-07T20:33:06.3445793Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:06.3446129Z x = x_sign * x_clamp 2025-05-07T20:33:06.3446380Z x0 = x[:, :D] 2025-05-07T20:33:06.3446608Z x1 = x[:, D:] 2025-05-07T20:33:06.3446826Z 2025-05-07T20:33:06.3447019Z if contiguous: 2025-05-07T20:33:06.3447263Z x0 = x0.contiguous() 2025-05-07T20:33:06.3447539Z x1 = x1.contiguous() 2025-05-07T20:33:06.3447789Z 2025-05-07T20:33:06.3447996Z if scale_ub is not None: 2025-05-07T20:33:06.3448284Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:06.3448636Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:06.3448969Z ) 2025-05-07T20:33:06.3449164Z else: 2025-05-07T20:33:06.3449491Z scale_ub_tensor = None 2025-05-07T20:33:06.3449757Z 2025-05-07T20:33:06.3449992Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:06.3450391Z op = silu_mul_quant 2025-05-07T20:33:06.3450646Z if compiled: 2025-05-07T20:33:06.3450895Z op = torch.compile(op) 2025-05-07T20:33:06.3451192Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:06.3451477Z 2025-05-07T20:33:06.3451782Z > y_fp8, y_scale = fn() 2025-05-07T20:33:06.3451951Z 2025-05-07T20:33:06.3452052Z moe/activation_test.py:117: 2025-05-07T20:33:06.3452368Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:06.3452729Z moe/activation_test.py:115: in fn 2025-05-07T20:33:06.3453055Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:06.3453656Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:06.3454269Z return fn(*args, **kwargs) 2025-05-07T20:33:06.3454982Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:06.3455737Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:06.3456304Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:06.3457041Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:06.3457757Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:06.3458330Z kernel = self.compile( 2025-05-07T20:33:06.3458901Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:06.3459607Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:06.3460025Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:06.3460270Z 2025-05-07T20:33:06.3460486Z self = 2025-05-07T20:33:06.3461669Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:06.3463264Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd1b7f53dc0>} 2025-05-07T20:33:06.3464796Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:06.3465909Z context = 2025-05-07T20:33:06.3466216Z 2025-05-07T20:33:06.3466397Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:06.3466945Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:06.3467443Z module_map=module_map) 2025-05-07T20:33:06.3467825Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:06.3468185Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:06.3468457Z E ^ 2025-05-07T20:33:06.3468956Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:06.3469447Z 2025-05-07T20:33:06.3470038Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:06.3470595Z 2025-05-07T20:33:06.3470700Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:06.3471134Z self=, 2025-05-07T20:33:06.3471560Z T=128, 2025-05-07T20:33:06.3471797Z D=7168, 2025-05-07T20:33:06.3471993Z scale_ub=1200.0, 2025-05-07T20:33:06.3472262Z contiguous=False, 2025-05-07T20:33:06.3472486Z compiled=True, 2025-05-07T20:33:06.3472701Z ) 2025-05-07T20:33:06.3473031Z self = 2025-05-07T20:33:06.3473557Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:06.3473884Z 2025-05-07T20:33:06.3473960Z @given( 2025-05-07T20:33:06.3474197Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:06.3474524Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:06.3474838Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:06.3475185Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:06.3475535Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:06.3475825Z ) 2025-05-07T20:33:06.3476197Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:06.3476667Z def test_silu_mul_quant( 2025-05-07T20:33:06.3476912Z self, 2025-05-07T20:33:06.3477112Z T: int, 2025-05-07T20:33:06.3477314Z D: int, 2025-05-07T20:33:06.3477538Z scale_ub: Optional[float], 2025-05-07T20:33:06.3477812Z contiguous: bool, 2025-05-07T20:33:06.3478060Z compiled: bool, 2025-05-07T20:33:06.3478292Z ) -> None: 2025-05-07T20:33:06.3478511Z torch.manual_seed(2025) 2025-05-07T20:33:06.3478767Z 2025-05-07T20:33:06.3479045Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:06.3479396Z 2025-05-07T20:33:06.3479596Z x_sign = torch.sign(x) 2025-05-07T20:33:06.3479895Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:06.3480211Z x = x_sign * x_clamp 2025-05-07T20:33:06.3480459Z x0 = x[:, :D] 2025-05-07T20:33:06.3480682Z x1 = x[:, D:] 2025-05-07T20:33:06.3480889Z 2025-05-07T20:33:06.3481079Z if contiguous: 2025-05-07T20:33:06.3481315Z x0 = x0.contiguous() 2025-05-07T20:33:06.3481577Z x1 = x1.contiguous() 2025-05-07T20:33:06.3481828Z 2025-05-07T20:33:06.3482026Z if scale_ub is not None: 2025-05-07T20:33:06.3482297Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:06.3482647Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:06.3483265Z ) 2025-05-07T20:33:06.3483532Z else: 2025-05-07T20:33:06.3483781Z scale_ub_tensor = None 2025-05-07T20:33:06.3484042Z 2025-05-07T20:33:06.3484282Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:06.3484690Z op = silu_mul_quant 2025-05-07T20:33:06.3484952Z if compiled: 2025-05-07T20:33:06.3485207Z op = torch.compile(op) 2025-05-07T20:33:06.3485510Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:06.3485803Z 2025-05-07T20:33:06.3486001Z > y_fp8, y_scale = fn() 2025-05-07T20:33:06.3486170Z 2025-05-07T20:33:06.3486272Z moe/activation_test.py:117: 2025-05-07T20:33:06.3486582Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:06.3486936Z moe/activation_test.py:115: in fn 2025-05-07T20:33:06.3487233Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:06.3487819Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:06.3488429Z return fn(*args, **kwargs) 2025-05-07T20:33:06.3489143Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:06.3489882Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:06.3490455Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:06.3491195Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:06.3491976Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:06.3492593Z kernel = self.compile( 2025-05-07T20:33:06.3493166Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:06.3493870Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:06.3494341Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:06.3494596Z 2025-05-07T20:33:06.3494808Z self = 2025-05-07T20:33:06.3495984Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:06.3497494Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd1b7e64940>} 2025-05-07T20:33:06.3498973Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:06.3500078Z context = 2025-05-07T20:33:06.3500401Z 2025-05-07T20:33:06.3500574Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:06.3501128Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:06.3501623Z module_map=module_map) 2025-05-07T20:33:06.3501994Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:06.3502360Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:06.3502630Z E ^ 2025-05-07T20:33:06.3503121Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:06.3503618Z 2025-05-07T20:33:06.3504067Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:06.3504630Z 2025-05-07T20:33:06.5233768Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:06.5234461Z self=, 2025-05-07T20:33:06.5235051Z T=2048, 2025-05-07T20:33:06.5235256Z D=7168, 2025-05-07T20:33:06.5235451Z scale_ub=None, 2025-05-07T20:33:06.5235681Z contiguous=True, 2025-05-07T20:33:06.5236171Z compiled=True, 2025-05-07T20:33:06.5236378Z ) 2025-05-07T20:33:06.5236714Z self = 2025-05-07T20:33:06.5237241Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:06.5237528Z 2025-05-07T20:33:06.5237608Z @given( 2025-05-07T20:33:06.5237850Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:06.5238182Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:06.5238492Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:06.5238839Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:06.5239179Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:06.5239481Z ) 2025-05-07T20:33:06.5239842Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:06.5240312Z def test_silu_mul_quant( 2025-05-07T20:33:06.5240561Z self, 2025-05-07T20:33:06.5240756Z T: int, 2025-05-07T20:33:06.5240958Z D: int, 2025-05-07T20:33:06.5241180Z scale_ub: Optional[float], 2025-05-07T20:33:06.5241454Z contiguous: bool, 2025-05-07T20:33:06.5241703Z compiled: bool, 2025-05-07T20:33:06.5241941Z ) -> None: 2025-05-07T20:33:06.5242152Z torch.manual_seed(2025) 2025-05-07T20:33:06.5242498Z 2025-05-07T20:33:06.5242784Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:06.5243211Z 2025-05-07T20:33:06.5243410Z x_sign = torch.sign(x) 2025-05-07T20:33:06.5243710Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:06.5244026Z x = x_sign * x_clamp 2025-05-07T20:33:06.5244277Z x0 = x[:, :D] 2025-05-07T20:33:06.5244502Z x1 = x[:, D:] 2025-05-07T20:33:06.5244802Z 2025-05-07T20:33:06.5244987Z if contiguous: 2025-05-07T20:33:06.5245228Z x0 = x0.contiguous() 2025-05-07T20:33:06.5245500Z x1 = x1.contiguous() 2025-05-07T20:33:06.5245749Z 2025-05-07T20:33:06.5245950Z if scale_ub is not None: 2025-05-07T20:33:06.5246229Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:06.5246572Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:06.5246898Z ) 2025-05-07T20:33:06.5247092Z else: 2025-05-07T20:33:06.5247303Z scale_ub_tensor = None 2025-05-07T20:33:06.5247570Z 2025-05-07T20:33:06.5247804Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:06.5248130Z op = silu_mul_quant 2025-05-07T20:33:06.5248390Z if compiled: 2025-05-07T20:33:06.5248646Z op = torch.compile(op) 2025-05-07T20:33:06.5248947Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:06.5249240Z 2025-05-07T20:33:06.5249439Z > y_fp8, y_scale = fn() 2025-05-07T20:33:06.5249610Z 2025-05-07T20:33:06.5249721Z moe/activation_test.py:117: 2025-05-07T20:33:06.5250027Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:06.5250381Z moe/activation_test.py:115: in fn 2025-05-07T20:33:06.5250680Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:06.5251280Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:06.5251891Z return fn(*args, **kwargs) 2025-05-07T20:33:06.5261385Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:06.5262226Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:06.5262817Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:06.5263566Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:06.5264291Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:06.5264947Z kernel = self.compile( 2025-05-07T20:33:06.5265525Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:06.5266233Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:06.5266658Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:06.5266906Z 2025-05-07T20:33:06.5267126Z self = 2025-05-07T20:33:06.5268304Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:06.5270007Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd1b7c6a550>} 2025-05-07T20:33:06.5271497Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:06.5272621Z context = 2025-05-07T20:33:06.5272927Z 2025-05-07T20:33:06.5273167Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:06.5273764Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:06.5274263Z module_map=module_map) 2025-05-07T20:33:06.5274649Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:06.5275015Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:06.5275333Z E ^ 2025-05-07T20:33:06.5275842Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:06.5276335Z 2025-05-07T20:33:06.5276790Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:06.5277358Z 2025-05-07T20:33:06.5277463Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:06.5277900Z self=, 2025-05-07T20:33:06.5278332Z T=16384, 2025-05-07T20:33:06.5278525Z D=5120, 2025-05-07T20:33:06.5278730Z scale_ub=None, 2025-05-07T20:33:06.5278953Z contiguous=False, 2025-05-07T20:33:06.5279176Z compiled=False, 2025-05-07T20:33:06.5279386Z ) 2025-05-07T20:33:06.5279717Z self = 2025-05-07T20:33:06.5280238Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:06.5280549Z 2025-05-07T20:33:06.5280627Z @given( 2025-05-07T20:33:06.5280858Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:06.5281184Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:06.5281505Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:06.5281851Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:06.5282187Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:06.5282493Z ) 2025-05-07T20:33:06.5283136Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:06.5283727Z def test_silu_mul_quant( 2025-05-07T20:33:06.5284017Z self, 2025-05-07T20:33:06.5284301Z T: int, 2025-05-07T20:33:06.5284511Z D: int, 2025-05-07T20:33:06.5284736Z scale_ub: Optional[float], 2025-05-07T20:33:06.5285017Z contiguous: bool, 2025-05-07T20:33:06.5285263Z compiled: bool, 2025-05-07T20:33:06.5285488Z ) -> None: 2025-05-07T20:33:06.5285709Z torch.manual_seed(2025) 2025-05-07T20:33:06.5285955Z 2025-05-07T20:33:06.5286227Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:06.5286705Z 2025-05-07T20:33:06.5286907Z x_sign = torch.sign(x) 2025-05-07T20:33:06.5287199Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:06.5289429Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:06.5291523Z 2025-05-07T20:33:06.5291646Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:06.5291877Z 2025-05-07T20:33:06.5291979Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:06.5292420Z self=, 2025-05-07T20:33:06.5292839Z T=4096, 2025-05-07T20:33:06.5293031Z D=7168, 2025-05-07T20:33:06.5293229Z scale_ub=1200.0, 2025-05-07T20:33:06.5293450Z contiguous=True, 2025-05-07T20:33:06.5293684Z compiled=True, 2025-05-07T20:33:06.5293891Z ) 2025-05-07T20:33:06.5294293Z self = 2025-05-07T20:33:06.5294861Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:06.5295159Z 2025-05-07T20:33:06.5295236Z @given( 2025-05-07T20:33:06.5295471Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:06.5295790Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:06.5296109Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:06.5296520Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:06.5296855Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:06.5297154Z ) 2025-05-07T20:33:06.5297521Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:06.5297989Z def test_silu_mul_quant( 2025-05-07T20:33:06.5298232Z self, 2025-05-07T20:33:06.5298434Z T: int, 2025-05-07T20:33:06.5298639Z D: int, 2025-05-07T20:33:06.5298859Z scale_ub: Optional[float], 2025-05-07T20:33:06.5299144Z contiguous: bool, 2025-05-07T20:33:06.5299395Z compiled: bool, 2025-05-07T20:33:06.5299619Z ) -> None: 2025-05-07T20:33:06.5299838Z torch.manual_seed(2025) 2025-05-07T20:33:06.5300088Z 2025-05-07T20:33:06.5300358Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:06.5300717Z 2025-05-07T20:33:06.5300915Z x_sign = torch.sign(x) 2025-05-07T20:33:06.5301208Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:06.5303412Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:06.5305490Z 2025-05-07T20:33:06.5305607Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:06.5305834Z 2025-05-07T20:33:06.5305935Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:06.5306367Z self=, 2025-05-07T20:33:06.5306788Z T=16384, 2025-05-07T20:33:06.5306985Z D=7168, 2025-05-07T20:33:06.5307182Z scale_ub=None, 2025-05-07T20:33:06.5307393Z contiguous=False, 2025-05-07T20:33:06.5307630Z compiled=False, 2025-05-07T20:33:06.5307890Z ) 2025-05-07T20:33:06.6351243Z self = 2025-05-07T20:33:06.6352537Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:06.6352947Z 2025-05-07T20:33:06.6353057Z @given( 2025-05-07T20:33:06.6353372Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:06.6353698Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:06.6354033Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:06.6354384Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:06.6354723Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:06.6355026Z ) 2025-05-07T20:33:06.6355397Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:06.6355878Z def test_silu_mul_quant( 2025-05-07T20:33:06.6356123Z self, 2025-05-07T20:33:06.6356325Z T: int, 2025-05-07T20:33:06.6356539Z D: int, 2025-05-07T20:33:06.6356758Z scale_ub: Optional[float], 2025-05-07T20:33:06.6357044Z contiguous: bool, 2025-05-07T20:33:06.6357292Z compiled: bool, 2025-05-07T20:33:06.6357524Z ) -> None: 2025-05-07T20:33:06.6357752Z torch.manual_seed(2025) 2025-05-07T20:33:06.6358026Z 2025-05-07T20:33:06.6358579Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:06.6360919Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:06.6363096Z 2025-05-07T20:33:06.6363214Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:06.6363444Z 2025-05-07T20:33:06.6363547Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:06.6363977Z self=, 2025-05-07T20:33:06.6364404Z T=2048, 2025-05-07T20:33:06.6364588Z D=7168, 2025-05-07T20:33:06.6364786Z scale_ub=1200.0, 2025-05-07T20:33:06.6365011Z contiguous=True, 2025-05-07T20:33:06.6365228Z compiled=True, 2025-05-07T20:33:06.6365436Z ) 2025-05-07T20:33:06.6365768Z self = 2025-05-07T20:33:06.6366277Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:06.6366577Z 2025-05-07T20:33:06.6366656Z @given( 2025-05-07T20:33:06.6366887Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:06.6367210Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:06.6367521Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:06.6367864Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:06.6368208Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:06.6368500Z ) 2025-05-07T20:33:06.6368865Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:06.6369336Z def test_silu_mul_quant( 2025-05-07T20:33:06.6369580Z self, 2025-05-07T20:33:06.6369779Z T: int, 2025-05-07T20:33:06.6369982Z D: int, 2025-05-07T20:33:06.6370197Z scale_ub: Optional[float], 2025-05-07T20:33:06.6370478Z contiguous: bool, 2025-05-07T20:33:06.6370726Z compiled: bool, 2025-05-07T20:33:06.6370942Z ) -> None: 2025-05-07T20:33:06.6371159Z torch.manual_seed(2025) 2025-05-07T20:33:06.6371407Z 2025-05-07T20:33:06.6371672Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:06.6372030Z 2025-05-07T20:33:06.6372301Z x_sign = torch.sign(x) 2025-05-07T20:33:06.6372597Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:06.6376026Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:06.6378078Z 2025-05-07T20:33:06.6378196Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:06.6378423Z 2025-05-07T20:33:06.6378524Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:06.6378952Z self=, 2025-05-07T20:33:06.6379372Z T=2048, 2025-05-07T20:33:06.6379549Z D=7168, 2025-05-07T20:33:06.6379741Z scale_ub=None, 2025-05-07T20:33:06.6379959Z contiguous=True, 2025-05-07T20:33:06.6380183Z compiled=False, 2025-05-07T20:33:06.6380398Z ) 2025-05-07T20:33:06.6380804Z self = 2025-05-07T20:33:06.6381318Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:06.6381649Z 2025-05-07T20:33:06.6381731Z @given( 2025-05-07T20:33:06.6381957Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:06.6382268Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:06.6382582Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:06.6383235Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:06.6383566Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:06.6383867Z ) 2025-05-07T20:33:06.6384234Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:06.6384698Z def test_silu_mul_quant( 2025-05-07T20:33:06.6384937Z self, 2025-05-07T20:33:06.6385135Z T: int, 2025-05-07T20:33:06.6385330Z D: int, 2025-05-07T20:33:06.6385541Z scale_ub: Optional[float], 2025-05-07T20:33:06.6385816Z contiguous: bool, 2025-05-07T20:33:06.6386057Z compiled: bool, 2025-05-07T20:33:06.6386279Z ) -> None: 2025-05-07T20:33:06.6386509Z torch.manual_seed(2025) 2025-05-07T20:33:06.6386749Z 2025-05-07T20:33:06.6387024Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:06.6387385Z 2025-05-07T20:33:06.6387578Z > x_sign = torch.sign(x) 2025-05-07T20:33:06.6389698Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:06.6391821Z 2025-05-07T20:33:06.6391935Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:33:06.6392166Z 2025-05-07T20:33:06.6392270Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:06.6392697Z self=, 2025-05-07T20:33:06.6393111Z T=1, 2025-05-07T20:33:06.6393296Z D=7168, 2025-05-07T20:33:06.6393491Z scale_ub=1200.0, 2025-05-07T20:33:06.6393712Z contiguous=True, 2025-05-07T20:33:06.6393936Z compiled=False, 2025-05-07T20:33:06.6394138Z ) 2025-05-07T20:33:06.9713201Z self = 2025-05-07T20:33:06.9713984Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:06.9714371Z 2025-05-07T20:33:06.9714481Z @given( 2025-05-07T20:33:06.9714708Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:06.9715030Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:06.9715351Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:06.9715691Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:06.9716031Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:06.9716324Z ) 2025-05-07T20:33:06.9716679Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:06.9717147Z def test_silu_mul_quant( 2025-05-07T20:33:06.9717402Z self, 2025-05-07T20:33:06.9717589Z T: int, 2025-05-07T20:33:06.9717786Z D: int, 2025-05-07T20:33:06.9718006Z scale_ub: Optional[float], 2025-05-07T20:33:06.9718277Z contiguous: bool, 2025-05-07T20:33:06.9718518Z compiled: bool, 2025-05-07T20:33:06.9718749Z ) -> None: 2025-05-07T20:33:06.9718964Z torch.manual_seed(2025) 2025-05-07T20:33:06.9719204Z 2025-05-07T20:33:06.9719483Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:06.9719840Z 2025-05-07T20:33:06.9720124Z x_sign = torch.sign(x) 2025-05-07T20:33:06.9720492Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:06.9720814Z x = x_sign * x_clamp 2025-05-07T20:33:06.9721053Z x0 = x[:, :D] 2025-05-07T20:33:06.9721273Z x1 = x[:, D:] 2025-05-07T20:33:06.9721478Z 2025-05-07T20:33:06.9721657Z if contiguous: 2025-05-07T20:33:06.9721888Z x0 = x0.contiguous() 2025-05-07T20:33:06.9722234Z x1 = x1.contiguous() 2025-05-07T20:33:06.9722472Z 2025-05-07T20:33:06.9722664Z if scale_ub is not None: 2025-05-07T20:33:06.9722943Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:06.9723282Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:06.9723605Z ) 2025-05-07T20:33:06.9723798Z else: 2025-05-07T20:33:06.9724007Z scale_ub_tensor = None 2025-05-07T20:33:06.9724255Z 2025-05-07T20:33:06.9724491Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:06.9724817Z op = silu_mul_quant 2025-05-07T20:33:06.9725070Z if compiled: 2025-05-07T20:33:06.9725322Z op = torch.compile(op) 2025-05-07T20:33:06.9725626Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:06.9725906Z 2025-05-07T20:33:06.9726097Z > y_fp8, y_scale = fn() 2025-05-07T20:33:06.9726263Z 2025-05-07T20:33:06.9726368Z moe/activation_test.py:117: 2025-05-07T20:33:06.9726667Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:06.9727008Z moe/activation_test.py:115: in fn 2025-05-07T20:33:06.9727296Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:06.9728039Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:06.9728781Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:06.9729356Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:06.9730089Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:06.9730794Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:06.9731363Z kernel = self.compile( 2025-05-07T20:33:06.9731936Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:06.9732687Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:06.9733144Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:06.9733393Z 2025-05-07T20:33:06.9733605Z self = 2025-05-07T20:33:06.9734782Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:06.9736300Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd1b7cdc040>} 2025-05-07T20:33:06.9737763Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:06.9738875Z context = 2025-05-07T20:33:06.9739184Z 2025-05-07T20:33:06.9739356Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:06.9739907Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:06.9740391Z module_map=module_map) 2025-05-07T20:33:06.9740812Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:06.9741178Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:06.9741483Z E ^ 2025-05-07T20:33:06.9741972Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:06.9742466Z 2025-05-07T20:33:06.9742911Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:06.9743508Z 2025-05-07T20:33:06.9743615Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:06.9744038Z self=, 2025-05-07T20:33:06.9744463Z T=128, 2025-05-07T20:33:06.9744650Z D=5120, 2025-05-07T20:33:06.9744840Z scale_ub=None, 2025-05-07T20:33:06.9745047Z contiguous=True, 2025-05-07T20:33:06.9745271Z compiled=False, 2025-05-07T20:33:06.9745482Z ) 2025-05-07T20:33:06.9745801Z self = 2025-05-07T20:33:06.9746321Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:06.9746607Z 2025-05-07T20:33:06.9746694Z @given( 2025-05-07T20:33:06.9746917Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:06.9747244Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:06.9747560Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:06.9747893Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:06.9748233Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:06.9748529Z ) 2025-05-07T20:33:06.9748892Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:06.9749356Z def test_silu_mul_quant( 2025-05-07T20:33:06.9749605Z self, 2025-05-07T20:33:06.9749955Z T: int, 2025-05-07T20:33:06.9750147Z D: int, 2025-05-07T20:33:06.9750367Z scale_ub: Optional[float], 2025-05-07T20:33:06.9750640Z contiguous: bool, 2025-05-07T20:33:06.9750885Z compiled: bool, 2025-05-07T20:33:06.9751110Z ) -> None: 2025-05-07T20:33:06.9751324Z torch.manual_seed(2025) 2025-05-07T20:33:06.9751561Z 2025-05-07T20:33:06.9751835Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:06.9752194Z 2025-05-07T20:33:06.9752381Z x_sign = torch.sign(x) 2025-05-07T20:33:06.9752673Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:06.9752992Z x = x_sign * x_clamp 2025-05-07T20:33:06.9753231Z x0 = x[:, :D] 2025-05-07T20:33:06.9753448Z x1 = x[:, D:] 2025-05-07T20:33:06.9753654Z 2025-05-07T20:33:06.9753889Z if contiguous: 2025-05-07T20:33:06.9754115Z x0 = x0.contiguous() 2025-05-07T20:33:06.9754376Z x1 = x1.contiguous() 2025-05-07T20:33:06.9754617Z 2025-05-07T20:33:06.9754798Z if scale_ub is not None: 2025-05-07T20:33:06.9755074Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:06.9755417Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:06.9755729Z ) 2025-05-07T20:33:06.9755923Z else: 2025-05-07T20:33:06.9756132Z scale_ub_tensor = None 2025-05-07T20:33:06.9756378Z 2025-05-07T20:33:06.9756609Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:06.9756937Z op = silu_mul_quant 2025-05-07T20:33:06.9757182Z if compiled: 2025-05-07T20:33:06.9757434Z op = torch.compile(op) 2025-05-07T20:33:06.9757737Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:06.9758010Z 2025-05-07T20:33:06.9758209Z > y_fp8, y_scale = fn() 2025-05-07T20:33:06.9758383Z 2025-05-07T20:33:06.9758482Z moe/activation_test.py:117: 2025-05-07T20:33:06.9758788Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:06.9759125Z moe/activation_test.py:115: in fn 2025-05-07T20:33:06.9759413Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:06.9760200Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:06.9760977Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:06.9761544Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:06.9762279Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:06.9763080Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:06.9763642Z kernel = self.compile( 2025-05-07T20:33:06.9764216Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:06.9764918Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:06.9765330Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:06.9765578Z 2025-05-07T20:33:06.9765790Z self = 2025-05-07T20:33:06.9766964Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:06.9768469Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd1b7cdca60>} 2025-05-07T20:33:06.9769939Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:06.9771044Z context = 2025-05-07T20:33:06.9771354Z 2025-05-07T20:33:06.9771524Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:06.9772076Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:06.9772567Z module_map=module_map) 2025-05-07T20:33:06.9772934Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:06.9773296Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:06.9773568Z E ^ 2025-05-07T20:33:06.9774052Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:06.9774546Z 2025-05-07T20:33:06.9775070Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:06.9775633Z 2025-05-07T20:33:06.9775735Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:06.9776164Z self=, 2025-05-07T20:33:06.9776581Z T=128, 2025-05-07T20:33:06.9776768Z D=7168, 2025-05-07T20:33:06.9776963Z scale_ub=None, 2025-05-07T20:33:06.9777169Z contiguous=True, 2025-05-07T20:33:06.9777392Z compiled=False, 2025-05-07T20:33:06.9777595Z ) 2025-05-07T20:33:07.0673751Z self = 2025-05-07T20:33:07.0674472Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:07.0674882Z 2025-05-07T20:33:07.0674989Z @given( 2025-05-07T20:33:07.0675306Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.0675695Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.0676040Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.0676383Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.0676716Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.0677016Z ) 2025-05-07T20:33:07.0677645Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.0678119Z def test_silu_mul_quant( 2025-05-07T20:33:07.0686239Z self, 2025-05-07T20:33:07.0686449Z T: int, 2025-05-07T20:33:07.0686657Z D: int, 2025-05-07T20:33:07.0686883Z scale_ub: Optional[float], 2025-05-07T20:33:07.0687154Z contiguous: bool, 2025-05-07T20:33:07.0687400Z compiled: bool, 2025-05-07T20:33:07.0687637Z ) -> None: 2025-05-07T20:33:07.0688021Z torch.manual_seed(2025) 2025-05-07T20:33:07.0688278Z 2025-05-07T20:33:07.0688563Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.0688924Z 2025-05-07T20:33:07.0689128Z x_sign = torch.sign(x) 2025-05-07T20:33:07.0689426Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:07.0689771Z x = x_sign * x_clamp 2025-05-07T20:33:07.0690012Z x0 = x[:, :D] 2025-05-07T20:33:07.0690232Z x1 = x[:, D:] 2025-05-07T20:33:07.0690442Z 2025-05-07T20:33:07.0690632Z if contiguous: 2025-05-07T20:33:07.0690871Z x0 = x0.contiguous() 2025-05-07T20:33:07.0691140Z x1 = x1.contiguous() 2025-05-07T20:33:07.0691389Z 2025-05-07T20:33:07.0691576Z if scale_ub is not None: 2025-05-07T20:33:07.0691852Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:07.0692209Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:07.0692532Z ) 2025-05-07T20:33:07.0692732Z else: 2025-05-07T20:33:07.0692952Z scale_ub_tensor = None 2025-05-07T20:33:07.0693204Z 2025-05-07T20:33:07.0693437Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:07.0693769Z op = silu_mul_quant 2025-05-07T20:33:07.0694024Z if compiled: 2025-05-07T20:33:07.0694276Z op = torch.compile(op) 2025-05-07T20:33:07.0694584Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.0694866Z 2025-05-07T20:33:07.0695071Z > y_fp8, y_scale = fn() 2025-05-07T20:33:07.0695240Z 2025-05-07T20:33:07.0695346Z moe/activation_test.py:117: 2025-05-07T20:33:07.0695654Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.0695999Z moe/activation_test.py:115: in fn 2025-05-07T20:33:07.0696292Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.0697031Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:07.0697781Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:07.0698425Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:07.0699163Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:07.0699875Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:07.0700442Z kernel = self.compile( 2025-05-07T20:33:07.0701012Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:07.0701715Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:07.0702131Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.0702373Z 2025-05-07T20:33:07.0702618Z self = 2025-05-07T20:33:07.0703810Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:07.0705332Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd1b7d6f790>} 2025-05-07T20:33:07.0706866Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:07.0708027Z context = 2025-05-07T20:33:07.0708332Z 2025-05-07T20:33:07.0708499Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:07.0709051Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:07.0709585Z module_map=module_map) 2025-05-07T20:33:07.0710080Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:07.0710438Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:07.0710703Z E ^ 2025-05-07T20:33:07.0711194Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:07.0711682Z 2025-05-07T20:33:07.0712134Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:07.0712701Z 2025-05-07T20:33:07.0712802Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.0713234Z self=, 2025-05-07T20:33:07.0713658Z T=2048, 2025-05-07T20:33:07.0713843Z D=7168, 2025-05-07T20:33:07.0714039Z scale_ub=1200.0, 2025-05-07T20:33:07.0714265Z contiguous=True, 2025-05-07T20:33:07.0714483Z compiled=False, 2025-05-07T20:33:07.0714690Z ) 2025-05-07T20:33:07.0715015Z self = 2025-05-07T20:33:07.0715535Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:07.0715831Z 2025-05-07T20:33:07.0715908Z @given( 2025-05-07T20:33:07.0716136Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.0716447Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.0716766Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.0717112Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.0717449Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.0717741Z ) 2025-05-07T20:33:07.0718103Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.0718565Z def test_silu_mul_quant( 2025-05-07T20:33:07.0718806Z self, 2025-05-07T20:33:07.0719000Z T: int, 2025-05-07T20:33:07.0719199Z D: int, 2025-05-07T20:33:07.0719412Z scale_ub: Optional[float], 2025-05-07T20:33:07.0719688Z contiguous: bool, 2025-05-07T20:33:07.0719980Z compiled: bool, 2025-05-07T20:33:07.0720202Z ) -> None: 2025-05-07T20:33:07.0720419Z torch.manual_seed(2025) 2025-05-07T20:33:07.0720667Z 2025-05-07T20:33:07.0720934Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.0723186Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:07.0725238Z 2025-05-07T20:33:07.0725358Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:07.0725587Z 2025-05-07T20:33:07.0725686Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.0726113Z self=, 2025-05-07T20:33:07.0726524Z T=1, 2025-05-07T20:33:07.0726706Z D=5120, 2025-05-07T20:33:07.0726894Z scale_ub=1200.0, 2025-05-07T20:33:07.0727156Z contiguous=True, 2025-05-07T20:33:07.0727382Z compiled=False, 2025-05-07T20:33:07.0727627Z ) 2025-05-07T20:33:07.1203388Z self = 2025-05-07T20:33:07.1204144Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:07.1204528Z 2025-05-07T20:33:07.1204644Z @given( 2025-05-07T20:33:07.1204873Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.1205412Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.1205736Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.1206083Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.1206427Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.1206728Z ) 2025-05-07T20:33:07.1207085Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.1207557Z def test_silu_mul_quant( 2025-05-07T20:33:07.1207803Z self, 2025-05-07T20:33:07.1208000Z T: int, 2025-05-07T20:33:07.1208199Z D: int, 2025-05-07T20:33:07.1208428Z scale_ub: Optional[float], 2025-05-07T20:33:07.1208701Z contiguous: bool, 2025-05-07T20:33:07.1208945Z compiled: bool, 2025-05-07T20:33:07.1209174Z ) -> None: 2025-05-07T20:33:07.1209391Z torch.manual_seed(2025) 2025-05-07T20:33:07.1209631Z 2025-05-07T20:33:07.1209907Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.1210270Z 2025-05-07T20:33:07.1210463Z x_sign = torch.sign(x) 2025-05-07T20:33:07.1210762Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:07.1211090Z x = x_sign * x_clamp 2025-05-07T20:33:07.1211332Z x0 = x[:, :D] 2025-05-07T20:33:07.1211551Z x1 = x[:, D:] 2025-05-07T20:33:07.1211763Z 2025-05-07T20:33:07.1211947Z if contiguous: 2025-05-07T20:33:07.1212180Z x0 = x0.contiguous() 2025-05-07T20:33:07.1212470Z x1 = x1.contiguous() 2025-05-07T20:33:07.1212744Z 2025-05-07T20:33:07.1212943Z if scale_ub is not None: 2025-05-07T20:33:07.1213225Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:07.1213567Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:07.1213898Z ) 2025-05-07T20:33:07.1214094Z else: 2025-05-07T20:33:07.1214307Z scale_ub_tensor = None 2025-05-07T20:33:07.1214569Z 2025-05-07T20:33:07.1214800Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:07.1215125Z op = silu_mul_quant 2025-05-07T20:33:07.1215374Z if compiled: 2025-05-07T20:33:07.1215726Z op = torch.compile(op) 2025-05-07T20:33:07.1216032Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.1216311Z 2025-05-07T20:33:07.1216505Z > y_fp8, y_scale = fn() 2025-05-07T20:33:07.1216674Z 2025-05-07T20:33:07.1216780Z moe/activation_test.py:117: 2025-05-07T20:33:07.1217081Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.1217433Z moe/activation_test.py:115: in fn 2025-05-07T20:33:07.1217726Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.1218474Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:07.1219214Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:07.1219784Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:07.1220518Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:07.1221221Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:07.1221793Z kernel = self.compile( 2025-05-07T20:33:07.1222367Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:07.1223184Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:07.1223653Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.1223902Z 2025-05-07T20:33:07.1224113Z self = 2025-05-07T20:33:07.1225281Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:07.1226867Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd1b7ca3040>} 2025-05-07T20:33:07.1228401Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:07.1229981Z context = 2025-05-07T20:33:07.1230298Z 2025-05-07T20:33:07.1230467Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:07.1231021Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:07.1231506Z module_map=module_map) 2025-05-07T20:33:07.1231888Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:07.1232247Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:07.1232513Z E ^ 2025-05-07T20:33:07.1233006Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:07.1233499Z 2025-05-07T20:33:07.1233947Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:07.1234502Z 2025-05-07T20:33:07.1234613Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.1235039Z self=, 2025-05-07T20:33:07.1235461Z T=2048, 2025-05-07T20:33:07.1235648Z D=5120, 2025-05-07T20:33:07.1235838Z scale_ub=None, 2025-05-07T20:33:07.1236048Z contiguous=True, 2025-05-07T20:33:07.1236279Z compiled=False, 2025-05-07T20:33:07.1236491Z ) 2025-05-07T20:33:07.1236817Z self = 2025-05-07T20:33:07.1237341Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:07.1237629Z 2025-05-07T20:33:07.1237781Z @given( 2025-05-07T20:33:07.1238006Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.1238327Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.1238644Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.1238985Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.1239329Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.1239629Z ) 2025-05-07T20:33:07.1239986Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.1240446Z def test_silu_mul_quant( 2025-05-07T20:33:07.1240696Z self, 2025-05-07T20:33:07.1240881Z T: int, 2025-05-07T20:33:07.1241077Z D: int, 2025-05-07T20:33:07.1241296Z scale_ub: Optional[float], 2025-05-07T20:33:07.1241563Z contiguous: bool, 2025-05-07T20:33:07.1241803Z compiled: bool, 2025-05-07T20:33:07.1242027Z ) -> None: 2025-05-07T20:33:07.1242240Z torch.manual_seed(2025) 2025-05-07T20:33:07.1242512Z 2025-05-07T20:33:07.1242813Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.1243162Z 2025-05-07T20:33:07.1243353Z > x_sign = torch.sign(x) 2025-05-07T20:33:07.1245532Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:07.1247669Z 2025-05-07T20:33:07.1247785Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:33:07.1248003Z 2025-05-07T20:33:07.1248114Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.1248535Z self=, 2025-05-07T20:33:07.1248960Z T=16384, 2025-05-07T20:33:07.1249156Z D=5120, 2025-05-07T20:33:07.1249339Z scale_ub=None, 2025-05-07T20:33:07.1249552Z contiguous=True, 2025-05-07T20:33:07.1249776Z compiled=False, 2025-05-07T20:33:07.1249968Z ) 2025-05-07T20:33:07.1250295Z self = 2025-05-07T20:33:07.1250814Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:07.1251105Z 2025-05-07T20:33:07.1251188Z @given( 2025-05-07T20:33:07.1251408Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.1251732Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.1252043Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.1252378Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.1252718Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.1253006Z ) 2025-05-07T20:33:07.1253363Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.1253828Z def test_silu_mul_quant( 2025-05-07T20:33:07.1254066Z self, 2025-05-07T20:33:07.1254255Z T: int, 2025-05-07T20:33:07.1254446Z D: int, 2025-05-07T20:33:07.1254663Z scale_ub: Optional[float], 2025-05-07T20:33:07.1254941Z contiguous: bool, 2025-05-07T20:33:07.1255175Z compiled: bool, 2025-05-07T20:33:07.1255397Z ) -> None: 2025-05-07T20:33:07.1255608Z torch.manual_seed(2025) 2025-05-07T20:33:07.1255849Z 2025-05-07T20:33:07.1256121Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.1258417Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:07.1260476Z 2025-05-07T20:33:07.1260603Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:07.1260820Z 2025-05-07T20:33:07.1260931Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.1261348Z self=, 2025-05-07T20:33:07.1261769Z T=4096, 2025-05-07T20:33:07.1261950Z D=5120, 2025-05-07T20:33:07.1262136Z scale_ub=None, 2025-05-07T20:33:07.1262351Z contiguous=True, 2025-05-07T20:33:07.1262575Z compiled=False, 2025-05-07T20:33:07.1262772Z ) 2025-05-07T20:33:07.2293498Z self = 2025-05-07T20:33:07.2294246Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:07.2294635Z 2025-05-07T20:33:07.2294745Z @given( 2025-05-07T20:33:07.2295050Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.2295703Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.2296110Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.2296628Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.2297030Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.2297321Z ) 2025-05-07T20:33:07.2297681Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.2298142Z def test_silu_mul_quant( 2025-05-07T20:33:07.2298463Z self, 2025-05-07T20:33:07.2298652Z T: int, 2025-05-07T20:33:07.2298840Z D: int, 2025-05-07T20:33:07.2299059Z scale_ub: Optional[float], 2025-05-07T20:33:07.2299333Z contiguous: bool, 2025-05-07T20:33:07.2299567Z compiled: bool, 2025-05-07T20:33:07.2299791Z ) -> None: 2025-05-07T20:33:07.2300008Z torch.manual_seed(2025) 2025-05-07T20:33:07.2300246Z 2025-05-07T20:33:07.2300518Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.2302757Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:07.2304825Z 2025-05-07T20:33:07.2304942Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:07.2305158Z 2025-05-07T20:33:07.2305266Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.2305682Z self=, 2025-05-07T20:33:07.2306101Z T=2048, 2025-05-07T20:33:07.2306284Z D=5120, 2025-05-07T20:33:07.2306467Z scale_ub=None, 2025-05-07T20:33:07.2306682Z contiguous=False, 2025-05-07T20:33:07.2306908Z compiled=False, 2025-05-07T20:33:07.2307107Z ) 2025-05-07T20:33:07.2307433Z self = 2025-05-07T20:33:07.2307950Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:07.2308237Z 2025-05-07T20:33:07.2308318Z @given( 2025-05-07T20:33:07.2308540Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.2308864Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.2309178Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.2309592Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.2310069Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.2310368Z ) 2025-05-07T20:33:07.2310723Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.2311190Z def test_silu_mul_quant( 2025-05-07T20:33:07.2311443Z self, 2025-05-07T20:33:07.2311635Z T: int, 2025-05-07T20:33:07.2311831Z D: int, 2025-05-07T20:33:07.2312051Z scale_ub: Optional[float], 2025-05-07T20:33:07.2312325Z contiguous: bool, 2025-05-07T20:33:07.2312562Z compiled: bool, 2025-05-07T20:33:07.2312788Z ) -> None: 2025-05-07T20:33:07.2313003Z torch.manual_seed(2025) 2025-05-07T20:33:07.2313242Z 2025-05-07T20:33:07.2313518Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.2315808Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:07.2317883Z 2025-05-07T20:33:07.2318003Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:07.2318220Z 2025-05-07T20:33:07.2318327Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.2318745Z self=, 2025-05-07T20:33:07.2319208Z T=4096, 2025-05-07T20:33:07.2319390Z D=7168, 2025-05-07T20:33:07.2319572Z scale_ub=None, 2025-05-07T20:33:07.2319785Z contiguous=True, 2025-05-07T20:33:07.2320006Z compiled=True, 2025-05-07T20:33:07.2320202Z ) 2025-05-07T20:33:07.2320530Z self = 2025-05-07T20:33:07.2321044Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:07.2321324Z 2025-05-07T20:33:07.2321399Z @given( 2025-05-07T20:33:07.2321626Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.2321944Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.2322262Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.2322596Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.2322934Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.2323227Z ) 2025-05-07T20:33:07.2323577Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.2324041Z def test_silu_mul_quant( 2025-05-07T20:33:07.2324284Z self, 2025-05-07T20:33:07.2324470Z T: int, 2025-05-07T20:33:07.2324667Z D: int, 2025-05-07T20:33:07.2324891Z scale_ub: Optional[float], 2025-05-07T20:33:07.2325157Z contiguous: bool, 2025-05-07T20:33:07.2325398Z compiled: bool, 2025-05-07T20:33:07.2325618Z ) -> None: 2025-05-07T20:33:07.2325826Z torch.manual_seed(2025) 2025-05-07T20:33:07.2326074Z 2025-05-07T20:33:07.2326350Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.2328597Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:07.2330699Z 2025-05-07T20:33:07.2330826Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:07.2331048Z 2025-05-07T20:33:07.2331152Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.2331586Z self=, 2025-05-07T20:33:07.2332013Z T=2048, 2025-05-07T20:33:07.2332197Z D=5120, 2025-05-07T20:33:07.2332387Z scale_ub=1200.0, 2025-05-07T20:33:07.2332613Z contiguous=False, 2025-05-07T20:33:07.2332835Z compiled=False, 2025-05-07T20:33:07.2333043Z ) 2025-05-07T20:33:07.2333371Z self = 2025-05-07T20:33:07.2333894Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:07.2334217Z 2025-05-07T20:33:07.2334304Z @given( 2025-05-07T20:33:07.2334525Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.2334847Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.2335167Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.2335501Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.2335839Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.2336132Z ) 2025-05-07T20:33:07.2336482Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.2336993Z def test_silu_mul_quant( 2025-05-07T20:33:07.2337304Z self, 2025-05-07T20:33:07.2337495Z T: int, 2025-05-07T20:33:07.2337682Z D: int, 2025-05-07T20:33:07.2337902Z scale_ub: Optional[float], 2025-05-07T20:33:07.2338173Z contiguous: bool, 2025-05-07T20:33:07.2338408Z compiled: bool, 2025-05-07T20:33:07.2338628Z ) -> None: 2025-05-07T20:33:07.2338841Z torch.manual_seed(2025) 2025-05-07T20:33:07.2339125Z 2025-05-07T20:33:07.2339398Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.2341627Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:07.2343727Z 2025-05-07T20:33:07.2343848Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:07.2344067Z 2025-05-07T20:33:07.2344177Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.2344594Z self=, 2025-05-07T20:33:07.2345016Z T=4096, 2025-05-07T20:33:07.2345200Z D=7168, 2025-05-07T20:33:07.2345383Z scale_ub=1200.0, 2025-05-07T20:33:07.2345603Z contiguous=True, 2025-05-07T20:33:07.2345824Z compiled=False, 2025-05-07T20:33:07.2346027Z ) 2025-05-07T20:33:07.2346360Z self = 2025-05-07T20:33:07.2346876Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:07.2347165Z 2025-05-07T20:33:07.2347245Z @given( 2025-05-07T20:33:07.2347470Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.2355169Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.2355538Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.2355903Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.2356253Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.2356557Z ) 2025-05-07T20:33:07.2356932Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.2357413Z def test_silu_mul_quant( 2025-05-07T20:33:07.2357679Z self, 2025-05-07T20:33:07.2357963Z T: int, 2025-05-07T20:33:07.2358176Z D: int, 2025-05-07T20:33:07.2358405Z scale_ub: Optional[float], 2025-05-07T20:33:07.2358686Z contiguous: bool, 2025-05-07T20:33:07.2358939Z compiled: bool, 2025-05-07T20:33:07.2359174Z ) -> None: 2025-05-07T20:33:07.2359397Z torch.manual_seed(2025) 2025-05-07T20:33:07.2359662Z 2025-05-07T20:33:07.2359955Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.2362236Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:07.2364308Z 2025-05-07T20:33:07.2364437Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:07.2364663Z 2025-05-07T20:33:07.2364769Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.2365252Z self=, 2025-05-07T20:33:07.2365680Z T=16384, 2025-05-07T20:33:07.2365914Z D=7168, 2025-05-07T20:33:07.2366115Z scale_ub=None, 2025-05-07T20:33:07.2366336Z contiguous=False, 2025-05-07T20:33:07.2366564Z compiled=True, 2025-05-07T20:33:07.2366772Z ) 2025-05-07T20:33:07.3653523Z self = 2025-05-07T20:33:07.3654310Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:07.3655033Z 2025-05-07T20:33:07.3655119Z @given( 2025-05-07T20:33:07.3655363Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.3655704Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.3656032Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.3656368Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.3656711Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.3657009Z ) 2025-05-07T20:33:07.3657373Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.3657851Z def test_silu_mul_quant( 2025-05-07T20:33:07.3658107Z self, 2025-05-07T20:33:07.3658314Z T: int, 2025-05-07T20:33:07.3658507Z D: int, 2025-05-07T20:33:07.3658732Z scale_ub: Optional[float], 2025-05-07T20:33:07.3659015Z contiguous: bool, 2025-05-07T20:33:07.3659254Z compiled: bool, 2025-05-07T20:33:07.3659490Z ) -> None: 2025-05-07T20:33:07.3659708Z torch.manual_seed(2025) 2025-05-07T20:33:07.3659947Z 2025-05-07T20:33:07.3660227Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.3662492Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:07.3664583Z 2025-05-07T20:33:07.3664700Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:07.3664918Z 2025-05-07T20:33:07.3665029Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.3665450Z self=, 2025-05-07T20:33:07.3665868Z T=4096, 2025-05-07T20:33:07.3666055Z D=7168, 2025-05-07T20:33:07.3666330Z scale_ub=None, 2025-05-07T20:33:07.3666548Z contiguous=True, 2025-05-07T20:33:07.3666769Z compiled=False, 2025-05-07T20:33:07.3666976Z ) 2025-05-07T20:33:07.3667302Z self = 2025-05-07T20:33:07.3667823Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:07.3668109Z 2025-05-07T20:33:07.3668196Z @given( 2025-05-07T20:33:07.3668418Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.3668738Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.3669048Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.3669378Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.3669713Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.3670157Z ) 2025-05-07T20:33:07.3670512Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.3670980Z def test_silu_mul_quant( 2025-05-07T20:33:07.3671247Z self, 2025-05-07T20:33:07.3671446Z T: int, 2025-05-07T20:33:07.3671632Z D: int, 2025-05-07T20:33:07.3671849Z scale_ub: Optional[float], 2025-05-07T20:33:07.3672125Z contiguous: bool, 2025-05-07T20:33:07.3672370Z compiled: bool, 2025-05-07T20:33:07.3672723Z ) -> None: 2025-05-07T20:33:07.3672945Z torch.manual_seed(2025) 2025-05-07T20:33:07.3673267Z 2025-05-07T20:33:07.3673539Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.3675777Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:07.3677873Z 2025-05-07T20:33:07.3677992Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:07.3678214Z 2025-05-07T20:33:07.3678322Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.3678753Z self=, 2025-05-07T20:33:07.3679180Z T=16384, 2025-05-07T20:33:07.3679375Z D=7168, 2025-05-07T20:33:07.3679568Z scale_ub=None, 2025-05-07T20:33:07.3679773Z contiguous=True, 2025-05-07T20:33:07.3679997Z compiled=False, 2025-05-07T20:33:07.3680200Z ) 2025-05-07T20:33:07.3680523Z self = 2025-05-07T20:33:07.3681051Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:07.3681345Z 2025-05-07T20:33:07.3681430Z @given( 2025-05-07T20:33:07.3681652Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.3681969Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.3682289Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.3682654Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.3683288Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.3683581Z ) 2025-05-07T20:33:07.3683944Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.3684410Z def test_silu_mul_quant( 2025-05-07T20:33:07.3684658Z self, 2025-05-07T20:33:07.3684868Z T: int, 2025-05-07T20:33:07.3685063Z D: int, 2025-05-07T20:33:07.3685282Z scale_ub: Optional[float], 2025-05-07T20:33:07.3685560Z contiguous: bool, 2025-05-07T20:33:07.3685796Z compiled: bool, 2025-05-07T20:33:07.3686019Z ) -> None: 2025-05-07T20:33:07.3686240Z torch.manual_seed(2025) 2025-05-07T20:33:07.3686477Z 2025-05-07T20:33:07.3686823Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.3689067Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:07.3691127Z 2025-05-07T20:33:07.3691243Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:07.3691462Z 2025-05-07T20:33:07.3691570Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.3691990Z self=, 2025-05-07T20:33:07.3692416Z T=16384, 2025-05-07T20:33:07.3692619Z D=7168, 2025-05-07T20:33:07.3692803Z scale_ub=1200.0, 2025-05-07T20:33:07.3693027Z contiguous=True, 2025-05-07T20:33:07.3693247Z compiled=False, 2025-05-07T20:33:07.3693445Z ) 2025-05-07T20:33:07.3693772Z self = 2025-05-07T20:33:07.3694361Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:07.3694711Z 2025-05-07T20:33:07.3694792Z @given( 2025-05-07T20:33:07.3695014Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.3695339Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.3695653Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.3695988Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.3696389Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.3696687Z ) 2025-05-07T20:33:07.3697047Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.3697516Z def test_silu_mul_quant( 2025-05-07T20:33:07.3697765Z self, 2025-05-07T20:33:07.3697953Z T: int, 2025-05-07T20:33:07.3698152Z D: int, 2025-05-07T20:33:07.3698377Z scale_ub: Optional[float], 2025-05-07T20:33:07.3698656Z contiguous: bool, 2025-05-07T20:33:07.3698892Z compiled: bool, 2025-05-07T20:33:07.3699121Z ) -> None: 2025-05-07T20:33:07.3699335Z torch.manual_seed(2025) 2025-05-07T20:33:07.3699572Z 2025-05-07T20:33:07.3699843Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.3702088Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:07.3704199Z 2025-05-07T20:33:07.3704323Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:07.3704544Z 2025-05-07T20:33:07.3704654Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.3705075Z self=, 2025-05-07T20:33:07.3705493Z T=128, 2025-05-07T20:33:07.3705680Z D=5120, 2025-05-07T20:33:07.3705866Z scale_ub=1200.0, 2025-05-07T20:33:07.3706089Z contiguous=False, 2025-05-07T20:33:07.3706314Z compiled=False, 2025-05-07T20:33:07.3706518Z ) 2025-05-07T20:33:07.5328610Z self = 2025-05-07T20:33:07.5329343Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:07.5329970Z 2025-05-07T20:33:07.5330066Z @given( 2025-05-07T20:33:07.5330310Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.5330634Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.5330955Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.5331312Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.5331644Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.5331947Z ) 2025-05-07T20:33:07.5332305Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.5332771Z def test_silu_mul_quant( 2025-05-07T20:33:07.5333012Z self, 2025-05-07T20:33:07.5333206Z T: int, 2025-05-07T20:33:07.5333404Z D: int, 2025-05-07T20:33:07.5333622Z scale_ub: Optional[float], 2025-05-07T20:33:07.5333899Z contiguous: bool, 2025-05-07T20:33:07.5334142Z compiled: bool, 2025-05-07T20:33:07.5334362Z ) -> None: 2025-05-07T20:33:07.5334586Z torch.manual_seed(2025) 2025-05-07T20:33:07.5334837Z 2025-05-07T20:33:07.5335103Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.5335459Z 2025-05-07T20:33:07.5335654Z x_sign = torch.sign(x) 2025-05-07T20:33:07.5336089Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:07.5336432Z x = x_sign * x_clamp 2025-05-07T20:33:07.5336745Z x0 = x[:, :D] 2025-05-07T20:33:07.5336965Z x1 = x[:, D:] 2025-05-07T20:33:07.5337182Z 2025-05-07T20:33:07.5337364Z if contiguous: 2025-05-07T20:33:07.5337601Z x0 = x0.contiguous() 2025-05-07T20:33:07.5337872Z x1 = x1.contiguous() 2025-05-07T20:33:07.5338123Z 2025-05-07T20:33:07.5338316Z if scale_ub is not None: 2025-05-07T20:33:07.5338684Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:07.5339031Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:07.5339345Z ) 2025-05-07T20:33:07.5339542Z else: 2025-05-07T20:33:07.5339753Z scale_ub_tensor = None 2025-05-07T20:33:07.5340007Z 2025-05-07T20:33:07.5340242Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:07.5340567Z op = silu_mul_quant 2025-05-07T20:33:07.5340816Z if compiled: 2025-05-07T20:33:07.5341067Z op = torch.compile(op) 2025-05-07T20:33:07.5341373Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.5341652Z 2025-05-07T20:33:07.5341841Z > y_fp8, y_scale = fn() 2025-05-07T20:33:07.5342008Z 2025-05-07T20:33:07.5342110Z moe/activation_test.py:117: 2025-05-07T20:33:07.5342416Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.5342812Z moe/activation_test.py:115: in fn 2025-05-07T20:33:07.5343099Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.5343844Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:07.5344581Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:07.5345150Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:07.5345882Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:07.5346592Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:07.5347155Z kernel = self.compile( 2025-05-07T20:33:07.5347731Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:07.5348431Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:07.5348839Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.5349087Z 2025-05-07T20:33:07.5349347Z self = 2025-05-07T20:33:07.5350776Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:07.5352307Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd1b79c5ca0>} 2025-05-07T20:33:07.5353782Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:07.5354884Z context = 2025-05-07T20:33:07.5355199Z 2025-05-07T20:33:07.5355369Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:07.5355926Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:07.5356420Z module_map=module_map) 2025-05-07T20:33:07.5356788Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:07.5357151Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:07.5357421Z E ^ 2025-05-07T20:33:07.5357957Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:07.5358496Z 2025-05-07T20:33:07.5358944Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:07.5359512Z 2025-05-07T20:33:07.5359616Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.5360082Z self=, 2025-05-07T20:33:07.5360494Z T=2048, 2025-05-07T20:33:07.5360686Z D=7168, 2025-05-07T20:33:07.5360881Z scale_ub=None, 2025-05-07T20:33:07.5361097Z contiguous=False, 2025-05-07T20:33:07.5361325Z compiled=False, 2025-05-07T20:33:07.5361536Z ) 2025-05-07T20:33:07.5361859Z self = 2025-05-07T20:33:07.5362378Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:07.5362669Z 2025-05-07T20:33:07.5362754Z @given( 2025-05-07T20:33:07.5362990Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.5363308Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.5363629Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.5363970Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.5364305Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.5364602Z ) 2025-05-07T20:33:07.5364966Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.5365421Z def test_silu_mul_quant( 2025-05-07T20:33:07.5365670Z self, 2025-05-07T20:33:07.5365863Z T: int, 2025-05-07T20:33:07.5366053Z D: int, 2025-05-07T20:33:07.5366268Z scale_ub: Optional[float], 2025-05-07T20:33:07.5366544Z contiguous: bool, 2025-05-07T20:33:07.5366779Z compiled: bool, 2025-05-07T20:33:07.5367001Z ) -> None: 2025-05-07T20:33:07.5367220Z torch.manual_seed(2025) 2025-05-07T20:33:07.5367468Z 2025-05-07T20:33:07.5367737Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.5370033Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:07.5372093Z 2025-05-07T20:33:07.5372211Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:07.5372433Z 2025-05-07T20:33:07.5372542Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.5373018Z self=, 2025-05-07T20:33:07.5373442Z T=128, 2025-05-07T20:33:07.5373628Z D=7168, 2025-05-07T20:33:07.5373817Z scale_ub=1200.0, 2025-05-07T20:33:07.5374035Z contiguous=True, 2025-05-07T20:33:07.5374254Z compiled=True, 2025-05-07T20:33:07.5374457Z ) 2025-05-07T20:33:07.5825389Z self = 2025-05-07T20:33:07.5826197Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:07.5826580Z 2025-05-07T20:33:07.5826684Z @given( 2025-05-07T20:33:07.5826999Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.5827351Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.5827667Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.5828008Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.5828347Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.5828639Z ) 2025-05-07T20:33:07.5829188Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.5829722Z def test_silu_mul_quant( 2025-05-07T20:33:07.5830093Z self, 2025-05-07T20:33:07.5830285Z T: int, 2025-05-07T20:33:07.5830479Z D: int, 2025-05-07T20:33:07.5830700Z scale_ub: Optional[float], 2025-05-07T20:33:07.5830970Z contiguous: bool, 2025-05-07T20:33:07.5831310Z compiled: bool, 2025-05-07T20:33:07.5831543Z ) -> None: 2025-05-07T20:33:07.5831763Z torch.manual_seed(2025) 2025-05-07T20:33:07.5832018Z 2025-05-07T20:33:07.5832305Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.5832660Z 2025-05-07T20:33:07.5832860Z x_sign = torch.sign(x) 2025-05-07T20:33:07.5833161Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:07.5833486Z x = x_sign * x_clamp 2025-05-07T20:33:07.5833738Z x0 = x[:, :D] 2025-05-07T20:33:07.5833965Z x1 = x[:, D:] 2025-05-07T20:33:07.5834173Z 2025-05-07T20:33:07.5834364Z if contiguous: 2025-05-07T20:33:07.5834604Z x0 = x0.contiguous() 2025-05-07T20:33:07.5834866Z x1 = x1.contiguous() 2025-05-07T20:33:07.5835119Z 2025-05-07T20:33:07.5835317Z if scale_ub is not None: 2025-05-07T20:33:07.5835596Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:07.5835939Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:07.5836261Z ) 2025-05-07T20:33:07.5836455Z else: 2025-05-07T20:33:07.5836660Z scale_ub_tensor = None 2025-05-07T20:33:07.5836919Z 2025-05-07T20:33:07.5837152Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:07.5837469Z op = silu_mul_quant 2025-05-07T20:33:07.5837722Z if compiled: 2025-05-07T20:33:07.5837969Z op = torch.compile(op) 2025-05-07T20:33:07.5838268Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.5838551Z 2025-05-07T20:33:07.5838742Z > y_fp8, y_scale = fn() 2025-05-07T20:33:07.5838911Z 2025-05-07T20:33:07.5839007Z moe/activation_test.py:117: 2025-05-07T20:33:07.5839310Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.5839658Z moe/activation_test.py:115: in fn 2025-05-07T20:33:07.5839947Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.5840532Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:07.5841131Z return fn(*args, **kwargs) 2025-05-07T20:33:07.5841929Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:07.5842670Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:07.5843236Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:07.5843975Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:07.5844686Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:07.5845248Z kernel = self.compile( 2025-05-07T20:33:07.5845820Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:07.5846519Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:07.5846927Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.5847176Z 2025-05-07T20:33:07.5847390Z self = 2025-05-07T20:33:07.5848607Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:07.5850122Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd1b793e0d0>} 2025-05-07T20:33:07.5851633Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:07.5852780Z context = 2025-05-07T20:33:07.5853091Z 2025-05-07T20:33:07.5853262Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:07.5853812Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:07.5854308Z module_map=module_map) 2025-05-07T20:33:07.5854683Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:07.5855048Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:07.5855315Z E ^ 2025-05-07T20:33:07.5855803Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:07.5856295Z 2025-05-07T20:33:07.5856740Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:07.5857298Z 2025-05-07T20:33:07.5857403Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.5857828Z self=, 2025-05-07T20:33:07.5858244Z T=128, 2025-05-07T20:33:07.5858434Z D=7168, 2025-05-07T20:33:07.5858629Z scale_ub=1200.0, 2025-05-07T20:33:07.5858840Z contiguous=True, 2025-05-07T20:33:07.5859063Z compiled=False, 2025-05-07T20:33:07.5859266Z ) 2025-05-07T20:33:07.5859585Z self = 2025-05-07T20:33:07.5860108Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:07.5860397Z 2025-05-07T20:33:07.5860478Z @given( 2025-05-07T20:33:07.5860708Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.5861023Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.5861338Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.5861678Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.5862011Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.5862302Z ) 2025-05-07T20:33:07.5862661Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.5863196Z def test_silu_mul_quant( 2025-05-07T20:33:07.5863443Z self, 2025-05-07T20:33:07.5863637Z T: int, 2025-05-07T20:33:07.5863826Z D: int, 2025-05-07T20:33:07.5864043Z scale_ub: Optional[float], 2025-05-07T20:33:07.5864318Z contiguous: bool, 2025-05-07T20:33:07.5864551Z compiled: bool, 2025-05-07T20:33:07.5864775Z ) -> None: 2025-05-07T20:33:07.5873213Z torch.manual_seed(2025) 2025-05-07T20:33:07.5873504Z 2025-05-07T20:33:07.5873797Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.5874164Z 2025-05-07T20:33:07.5874363Z x_sign = torch.sign(x) 2025-05-07T20:33:07.5874658Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:07.5876866Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:07.5879040Z 2025-05-07T20:33:07.5879166Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:07.5879439Z 2025-05-07T20:33:07.5879554Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.5879985Z self=, 2025-05-07T20:33:07.5880419Z T=128, 2025-05-07T20:33:07.5880615Z D=5120, 2025-05-07T20:33:07.5880815Z scale_ub=1200.0, 2025-05-07T20:33:07.5881083Z contiguous=True, 2025-05-07T20:33:07.5881311Z compiled=True, 2025-05-07T20:33:07.5881518Z ) 2025-05-07T20:33:07.5881840Z self = 2025-05-07T20:33:07.5882365Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:07.5882647Z 2025-05-07T20:33:07.5883098Z @given( 2025-05-07T20:33:07.5883336Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.5883662Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.5883984Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.5884315Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.5884658Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.5884946Z ) 2025-05-07T20:33:07.5885314Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.5885778Z def test_silu_mul_quant( 2025-05-07T20:33:07.5886030Z self, 2025-05-07T20:33:07.5886231Z T: int, 2025-05-07T20:33:07.5886427Z D: int, 2025-05-07T20:33:07.5886650Z scale_ub: Optional[float], 2025-05-07T20:33:07.5886932Z contiguous: bool, 2025-05-07T20:33:07.5887171Z compiled: bool, 2025-05-07T20:33:07.5887402Z ) -> None: 2025-05-07T20:33:07.5887617Z torch.manual_seed(2025) 2025-05-07T20:33:07.5887863Z 2025-05-07T20:33:07.5888134Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.5888492Z 2025-05-07T20:33:07.5888686Z x_sign = torch.sign(x) 2025-05-07T20:33:07.5888975Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:07.5891281Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:07.5893379Z 2025-05-07T20:33:07.5893495Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:07.5893712Z 2025-05-07T20:33:07.5893820Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.5894242Z self=, 2025-05-07T20:33:07.5894657Z T=128, 2025-05-07T20:33:07.5894842Z D=7168, 2025-05-07T20:33:07.5895032Z scale_ub=None, 2025-05-07T20:33:07.5895238Z contiguous=True, 2025-05-07T20:33:07.5895455Z compiled=True, 2025-05-07T20:33:07.5895652Z ) 2025-05-07T20:33:07.7945735Z self = 2025-05-07T20:33:07.7946481Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:07.7946887Z 2025-05-07T20:33:07.7946979Z @given( 2025-05-07T20:33:07.7947221Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.7947555Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.7947873Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.7948208Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.7948549Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.7948843Z ) 2025-05-07T20:33:07.7949422Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.7950127Z def test_silu_mul_quant( 2025-05-07T20:33:07.7950374Z self, 2025-05-07T20:33:07.7950564Z T: int, 2025-05-07T20:33:07.7950777Z D: int, 2025-05-07T20:33:07.7950999Z scale_ub: Optional[float], 2025-05-07T20:33:07.7951272Z contiguous: bool, 2025-05-07T20:33:07.7951516Z compiled: bool, 2025-05-07T20:33:07.7951839Z ) -> None: 2025-05-07T20:33:07.7952067Z torch.manual_seed(2025) 2025-05-07T20:33:07.7952316Z 2025-05-07T20:33:07.7952594Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.7954847Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:07.7956913Z 2025-05-07T20:33:07.7957041Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:07.7957260Z 2025-05-07T20:33:07.7967260Z FAILED 2025-05-07T20:33:07.7967552Z 2025-05-07T20:33:07.7967928Z =================================== FAILURES =================================== 2025-05-07T20:33:07.7968567Z _____________________ ActivationTests.test_silu_mul_quant ______________________ 2025-05-07T20:33:07.7969260Z + Exception Group Traceback (most recent call last): 2025-05-07T20:33:07.7970115Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/unittest/case.py", line 59, in testPartExecutor 2025-05-07T20:33:07.7970902Z | yield 2025-05-07T20:33:07.7971505Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/unittest/case.py", line 592, in run 2025-05-07T20:33:07.7972251Z | self._callTestMethod(testMethod) 2025-05-07T20:33:07.7973088Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/unittest/case.py", line 550, in _callTestMethod 2025-05-07T20:33:07.7973854Z | method() 2025-05-07T20:33:07.7974769Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 75, in test_silu_mul_quant 2025-05-07T20:33:07.7975838Z | T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.7976883Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/hypothesis/core.py", line 1850, in wrapped_test 2025-05-07T20:33:07.7977786Z | raise the_error_hypothesis_found 2025-05-07T20:33:07.7978478Z | exceptiongroup.ExceptionGroup: Hypothesis found 4 distinct failures. (4 sub-exceptions) 2025-05-07T20:33:07.7979149Z +-+---------------- 1 ---------------- 2025-05-07T20:33:07.7979560Z | Traceback (most recent call last): 2025-05-07T20:33:07.7980586Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:33:07.7981704Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.7984942Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:07.7987854Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:33:07.7988566Z | self=, 2025-05-07T20:33:07.7989210Z | T=2048, 2025-05-07T20:33:07.7989529Z | D=5120, # or any other generated value 2025-05-07T20:33:07.7990217Z | scale_ub=None, # or any other generated value 2025-05-07T20:33:07.7990721Z | contiguous=True, # or any other generated value 2025-05-07T20:33:07.7991236Z | compiled=False, # or any other generated value 2025-05-07T20:33:07.7991749Z | ) 2025-05-07T20:33:07.7991988Z | 2025-05-07T20:33:07.7992732Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEECQQBBAEEAQQE=') as a decorator on your test case 2025-05-07T20:33:07.7993607Z +---------------- 2 ---------------- 2025-05-07T20:33:07.7994012Z | Traceback (most recent call last): 2025-05-07T20:33:07.7995041Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:33:07.7996159Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.7999159Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:07.8002060Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:33:07.8002685Z | self=, 2025-05-07T20:33:07.8003266Z | T=128, 2025-05-07T20:33:07.8003555Z | D=7168, 2025-05-07T20:33:07.8003816Z | scale_ub=None, 2025-05-07T20:33:07.8004064Z | contiguous=True, 2025-05-07T20:33:07.8004317Z | compiled=True, 2025-05-07T20:33:07.8004539Z | ) 2025-05-07T20:33:07.8004727Z | 2025-05-07T20:33:07.8005285Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQFBAEEAQQA=') as a decorator on your test case 2025-05-07T20:33:07.8005930Z +---------------- 3 ---------------- 2025-05-07T20:33:07.8006239Z | Traceback (most recent call last): 2025-05-07T20:33:07.8007085Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:33:07.8007931Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.8010172Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:07.8012334Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:33:07.8012799Z | self=, 2025-05-07T20:33:07.8013229Z | T=128, 2025-05-07T20:33:07.8013448Z | D=5120, 2025-05-07T20:33:07.8013665Z | scale_ub=1200.0, 2025-05-07T20:33:07.8013920Z | contiguous=True, 2025-05-07T20:33:07.8014162Z | compiled=True, 2025-05-07T20:33:07.8014399Z | ) 2025-05-07T20:33:07.8014583Z | 2025-05-07T20:33:07.8015184Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQBBAUEAQQA=') as a decorator on your test case 2025-05-07T20:33:07.8015876Z +---------------- 4 ---------------- 2025-05-07T20:33:07.8016180Z | Traceback (most recent call last): 2025-05-07T20:33:07.8016939Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 126, in test_silu_mul_quant 2025-05-07T20:33:07.8017865Z | y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:07.8018913Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 124, in ref_fn 2025-05-07T20:33:07.8019946Z | return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:07.8021175Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py", line 2370, in triton_quantize_fp8_row 2025-05-07T20:33:07.8022382Z | _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:07.8023300Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py", line 330, in 2025-05-07T20:33:07.8024376Z | return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:07.8025468Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py", line 186, in run 2025-05-07T20:33:07.8026622Z | timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:07.8027833Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py", line 186, in 2025-05-07T20:33:07.8029040Z | timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:07.8030334Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py", line 166, in _bench 2025-05-07T20:33:07.8031381Z | return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:07.8032349Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py", line 117, in do_bench 2025-05-07T20:33:07.8033205Z | fn() 2025-05-07T20:33:07.8034048Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py", line 152, in kernel_call 2025-05-07T20:33:07.8035001Z | self.fn.run( 2025-05-07T20:33:07.8035926Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py", line 623, in run 2025-05-07T20:33:07.8036790Z | kernel = self.compile( 2025-05-07T20:33:07.8037696Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 273, in compile 2025-05-07T20:33:07.8038758Z | module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:07.8039823Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:33:07.8040965Z | return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:07.8041723Z | triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:07.8042234Z | def _kernel_quantize_fp8_row( 2025-05-07T20:33:07.8042616Z | ^ 2025-05-07T20:33:07.8043308Z | ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:07.8044169Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:33:07.8044761Z | # The test always failed when commented parts were varied together. 2025-05-07T20:33:07.8045517Z | self=, 2025-05-07T20:33:07.8046166Z | T=1, # or any other generated value 2025-05-07T20:33:07.8046703Z | D=5120, # or any other generated value 2025-05-07T20:33:07.8047238Z | scale_ub=None, # or any other generated value 2025-05-07T20:33:07.8047774Z | contiguous=True, # or any other generated value 2025-05-07T20:33:07.8048313Z | compiled=True, # or any other generated value 2025-05-07T20:33:07.8048762Z | ) 2025-05-07T20:33:07.8049020Z | 2025-05-07T20:33:07.8049766Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEAQQBBAEEAQQA=') as a decorator on your test case 2025-05-07T20:33:07.8050736Z +------------------------------------ 2025-05-07T20:33:07.8051260Z ---------------------------------- Hypothesis ---------------------------------- 2025-05-07T20:33:07.8051816Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.8052422Z self=, 2025-05-07T20:33:07.8053067Z T=1, 2025-05-07T20:33:07.8053335Z D=5120, 2025-05-07T20:33:07.8053614Z scale_ub=None, 2025-05-07T20:33:07.8053911Z contiguous=True, 2025-05-07T20:33:07.8054220Z compiled=True, 2025-05-07T20:33:07.8054512Z ) 2025-05-07T20:33:07.8054961Z self = 2025-05-07T20:33:07.8055674Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:07.8056066Z 2025-05-07T20:33:07.8056178Z @given( 2025-05-07T20:33:07.8056509Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.8056960Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.8057408Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.8057896Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.8058373Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.8058797Z ) 2025-05-07T20:33:07.8059311Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.8059974Z def test_silu_mul_quant( 2025-05-07T20:33:07.8060324Z self, 2025-05-07T20:33:07.8060606Z T: int, 2025-05-07T20:33:07.8060893Z D: int, 2025-05-07T20:33:07.8061198Z scale_ub: Optional[float], 2025-05-07T20:33:07.8061591Z contiguous: bool, 2025-05-07T20:33:07.8061939Z compiled: bool, 2025-05-07T20:33:07.8062252Z ) -> None: 2025-05-07T20:33:07.8062558Z torch.manual_seed(2025) 2025-05-07T20:33:07.8062920Z 2025-05-07T20:33:07.8063307Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.8063811Z 2025-05-07T20:33:07.8064091Z x_sign = torch.sign(x) 2025-05-07T20:33:07.8064561Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:07.8065013Z x = x_sign * x_clamp 2025-05-07T20:33:07.8065358Z x0 = x[:, :D] 2025-05-07T20:33:07.8065652Z x1 = x[:, D:] 2025-05-07T20:33:07.8065954Z 2025-05-07T20:33:07.8066218Z if contiguous: 2025-05-07T20:33:07.8066544Z x0 = x0.contiguous() 2025-05-07T20:33:07.8066918Z x1 = x1.contiguous() 2025-05-07T20:33:07.8067270Z 2025-05-07T20:33:07.8067537Z if scale_ub is not None: 2025-05-07T20:33:07.8067931Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:07.8068409Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:07.8068861Z ) 2025-05-07T20:33:07.8069132Z else: 2025-05-07T20:33:07.8069444Z scale_ub_tensor = None 2025-05-07T20:33:07.8069960Z 2025-05-07T20:33:07.8070291Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:07.8070752Z op = silu_mul_quant 2025-05-07T20:33:07.8071121Z if compiled: 2025-05-07T20:33:07.8071477Z op = torch.compile(op) 2025-05-07T20:33:07.8071915Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.8072321Z 2025-05-07T20:33:07.8072591Z y_fp8, y_scale = fn() 2025-05-07T20:33:07.8073060Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:07.8073494Z 2025-05-07T20:33:07.8073820Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:07.8074354Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:07.8074787Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:07.8075244Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:07.8075761Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:07.8076254Z 2025-05-07T20:33:07.8076540Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:07.8076832Z 2025-05-07T20:33:07.8076974Z moe/activation_test.py:126: 2025-05-07T20:33:07.8077418Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.8077891Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:07.8078337Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:07.8079500Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:07.8080631Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:07.8081424Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:07.8082434Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:07.8083732Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:07.8084829Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:07.8085966Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:33:07.8087086Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:07.8088194Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:07.8089158Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:07.8090056Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:07.8090792Z fn() 2025-05-07T20:33:07.8091527Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:07.8092410Z self.fn.run( 2025-05-07T20:33:07.8093070Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:07.8093999Z kernel = self.compile( 2025-05-07T20:33:07.8094778Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:07.8095709Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:07.8096269Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.8096606Z 2025-05-07T20:33:07.8096898Z self = 2025-05-07T20:33:07.8098500Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:07.8100642Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd1bbfdc9d0>} 2025-05-07T20:33:07.8102704Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:07.8104249Z context = 2025-05-07T20:33:07.8104694Z 2025-05-07T20:33:07.8105016Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:07.8105871Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:07.8106542Z module_map=module_map) 2025-05-07T20:33:07.8107054Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:07.8107551Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:07.8108008Z E ^ 2025-05-07T20:33:07.8108692Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:07.8109374Z 2025-05-07T20:33:07.8110105Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:07.8110872Z 2025-05-07T20:33:07.8111027Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.8111599Z self=, 2025-05-07T20:33:07.8112189Z T=2048, 2025-05-07T20:33:07.8112459Z D=5120, 2025-05-07T20:33:07.8112722Z scale_ub=1200.0, 2025-05-07T20:33:07.8113014Z contiguous=True, 2025-05-07T20:33:07.8113309Z compiled=False, 2025-05-07T20:33:07.8113584Z ) 2025-05-07T20:33:07.8114041Z self = 2025-05-07T20:33:07.8114771Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:07.8115145Z 2025-05-07T20:33:07.8115259Z @given( 2025-05-07T20:33:07.8115575Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.8116034Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.8116481Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.8116956Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.8117438Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.8117862Z ) 2025-05-07T20:33:07.8118371Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.8119052Z def test_silu_mul_quant( 2025-05-07T20:33:07.8119385Z self, 2025-05-07T20:33:07.8119646Z T: int, 2025-05-07T20:33:07.8119912Z D: int, 2025-05-07T20:33:07.8120216Z scale_ub: Optional[float], 2025-05-07T20:33:07.8120591Z contiguous: bool, 2025-05-07T20:33:07.8120923Z compiled: bool, 2025-05-07T20:33:07.8121252Z ) -> None: 2025-05-07T20:33:07.8121567Z torch.manual_seed(2025) 2025-05-07T20:33:07.8121920Z 2025-05-07T20:33:07.8122313Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.8122929Z 2025-05-07T20:33:07.8123219Z x_sign = torch.sign(x) 2025-05-07T20:33:07.8123636Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:07.8124091Z x = x_sign * x_clamp 2025-05-07T20:33:07.8124436Z x0 = x[:, :D] 2025-05-07T20:33:07.8124746Z x1 = x[:, D:] 2025-05-07T20:33:07.8125053Z 2025-05-07T20:33:07.8125312Z if contiguous: 2025-05-07T20:33:07.8125648Z x0 = x0.contiguous() 2025-05-07T20:33:07.8126024Z x1 = x1.contiguous() 2025-05-07T20:33:07.8126367Z 2025-05-07T20:33:07.8126634Z if scale_ub is not None: 2025-05-07T20:33:07.8127002Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:07.8127451Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:07.8127877Z ) 2025-05-07T20:33:07.8128155Z else: 2025-05-07T20:33:07.8147495Z scale_ub_tensor = None 2025-05-07T20:33:07.8147846Z 2025-05-07T20:33:07.8148179Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:07.8148658Z op = silu_mul_quant 2025-05-07T20:33:07.8149022Z if compiled: 2025-05-07T20:33:07.8149390Z op = torch.compile(op) 2025-05-07T20:33:07.8149950Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.8150372Z 2025-05-07T20:33:07.8150764Z > y_fp8, y_scale = fn() 2025-05-07T20:33:07.8151016Z 2025-05-07T20:33:07.8151249Z moe/activation_test.py:117: 2025-05-07T20:33:07.8151686Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.8152171Z moe/activation_test.py:115: in fn 2025-05-07T20:33:07.8152588Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.8153636Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:07.8154692Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:07.8155447Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:07.8156421Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:07.8157369Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:07.8158123Z kernel = self.compile( 2025-05-07T20:33:07.8158907Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:07.8159887Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:07.8160473Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.8160816Z 2025-05-07T20:33:07.8161112Z self = 2025-05-07T20:33:07.8162808Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:07.8164885Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd1bb06cdc0>} 2025-05-07T20:33:07.8166931Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:07.8168477Z context = 2025-05-07T20:33:07.8168914Z 2025-05-07T20:33:07.8169152Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:07.8169878Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:07.8170543Z module_map=module_map) 2025-05-07T20:33:07.8171103Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:07.8171595Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:07.8171954Z E ^ 2025-05-07T20:33:07.8172606Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:07.8173319Z 2025-05-07T20:33:07.8173920Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:07.8174670Z 2025-05-07T20:33:07.8174811Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.8175386Z self=, 2025-05-07T20:33:07.8175944Z T=2048, 2025-05-07T20:33:07.8176201Z D=5120, 2025-05-07T20:33:07.8176475Z scale_ub=1200.0, 2025-05-07T20:33:07.8176780Z contiguous=True, 2025-05-07T20:33:07.8177076Z compiled=True, 2025-05-07T20:33:07.8177359Z ) 2025-05-07T20:33:07.8177816Z self = 2025-05-07T20:33:07.8178473Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:07.8178882Z 2025-05-07T20:33:07.8178990Z @given( 2025-05-07T20:33:07.8179303Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.8179823Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.8180265Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.8180794Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.8181277Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.8181679Z ) 2025-05-07T20:33:07.8182088Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.8182557Z def test_silu_mul_quant( 2025-05-07T20:33:07.8183183Z self, 2025-05-07T20:33:07.8183385Z T: int, 2025-05-07T20:33:07.8183582Z D: int, 2025-05-07T20:33:07.8183794Z scale_ub: Optional[float], 2025-05-07T20:33:07.8184076Z contiguous: bool, 2025-05-07T20:33:07.8184321Z compiled: bool, 2025-05-07T20:33:07.8184546Z ) -> None: 2025-05-07T20:33:07.8184757Z torch.manual_seed(2025) 2025-05-07T20:33:07.8185005Z 2025-05-07T20:33:07.8185280Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.8185634Z 2025-05-07T20:33:07.8185823Z x_sign = torch.sign(x) 2025-05-07T20:33:07.8186123Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:07.8186437Z x = x_sign * x_clamp 2025-05-07T20:33:07.8186679Z x0 = x[:, :D] 2025-05-07T20:33:07.8186897Z x1 = x[:, D:] 2025-05-07T20:33:07.8187098Z 2025-05-07T20:33:07.8187281Z if contiguous: 2025-05-07T20:33:07.8187511Z x0 = x0.contiguous() 2025-05-07T20:33:07.8187768Z x1 = x1.contiguous() 2025-05-07T20:33:07.8188014Z 2025-05-07T20:33:07.8188208Z if scale_ub is not None: 2025-05-07T20:33:07.8188483Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:07.8188826Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:07.8189147Z ) 2025-05-07T20:33:07.8189334Z else: 2025-05-07T20:33:07.8189551Z scale_ub_tensor = None 2025-05-07T20:33:07.8189939Z 2025-05-07T20:33:07.8190168Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:07.8190493Z op = silu_mul_quant 2025-05-07T20:33:07.8190746Z if compiled: 2025-05-07T20:33:07.8190994Z op = torch.compile(op) 2025-05-07T20:33:07.8191291Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.8191571Z 2025-05-07T20:33:07.8191762Z y_fp8, y_scale = fn() 2025-05-07T20:33:07.8192042Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:07.8192345Z 2025-05-07T20:33:07.8192582Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:07.8192918Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:07.8193376Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:07.8193702Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:07.8194072Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:07.8194384Z 2025-05-07T20:33:07.8194581Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:07.8194789Z 2025-05-07T20:33:07.8194889Z moe/activation_test.py:126: 2025-05-07T20:33:07.8195188Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.8195534Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:07.8195868Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:07.8196710Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:07.8197522Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:07.8198103Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:07.8198832Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:07.8199564Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:07.8200403Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:07.8201264Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:33:07.8202064Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:07.8202837Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:07.8203583Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:07.8204220Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:07.8204763Z fn() 2025-05-07T20:33:07.8205295Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:07.8205914Z self.fn.run( 2025-05-07T20:33:07.8206401Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:07.8206960Z kernel = self.compile( 2025-05-07T20:33:07.8207525Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:07.8208216Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:07.8208620Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.8208870Z 2025-05-07T20:33:07.8209083Z self = 2025-05-07T20:33:07.8210250Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:07.8211755Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd1baa53550>} 2025-05-07T20:33:07.8213226Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:07.8214330Z context = 2025-05-07T20:33:07.8214634Z 2025-05-07T20:33:07.8214801Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:07.8215348Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:07.8215881Z module_map=module_map) 2025-05-07T20:33:07.8216251Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:07.8216609Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:07.8216877Z E ^ 2025-05-07T20:33:07.8217361Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:07.8217857Z 2025-05-07T20:33:07.8218302Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:07.8219506Z 2025-05-07T20:33:07.8219609Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.8220031Z self=, 2025-05-07T20:33:07.8220450Z T=16384, 2025-05-07T20:33:07.8220643Z D=7168, 2025-05-07T20:33:07.8220832Z scale_ub=1200.0, 2025-05-07T20:33:07.8221049Z contiguous=False, 2025-05-07T20:33:07.8221277Z compiled=False, 2025-05-07T20:33:07.8221479Z ) 2025-05-07T20:33:07.8221794Z self = 2025-05-07T20:33:07.8222315Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:07.8222635Z 2025-05-07T20:33:07.8222720Z @given( 2025-05-07T20:33:07.8223017Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.8223368Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.8223682Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.8224017Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.8224344Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.8224635Z ) 2025-05-07T20:33:07.8224992Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.8225490Z def test_silu_mul_quant( 2025-05-07T20:33:07.8225729Z self, 2025-05-07T20:33:07.8225916Z T: int, 2025-05-07T20:33:07.8226110Z D: int, 2025-05-07T20:33:07.8226324Z scale_ub: Optional[float], 2025-05-07T20:33:07.8226596Z contiguous: bool, 2025-05-07T20:33:07.8226829Z compiled: bool, 2025-05-07T20:33:07.8227043Z ) -> None: 2025-05-07T20:33:07.8227256Z torch.manual_seed(2025) 2025-05-07T20:33:07.8227493Z 2025-05-07T20:33:07.8227758Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.8228112Z 2025-05-07T20:33:07.8228297Z x_sign = torch.sign(x) 2025-05-07T20:33:07.8228583Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:07.8228900Z x = x_sign * x_clamp 2025-05-07T20:33:07.8229139Z x0 = x[:, :D] 2025-05-07T20:33:07.8229350Z x1 = x[:, D:] 2025-05-07T20:33:07.8229565Z 2025-05-07T20:33:07.8229750Z if contiguous: 2025-05-07T20:33:07.8230079Z x0 = x0.contiguous() 2025-05-07T20:33:07.8230342Z x1 = x1.contiguous() 2025-05-07T20:33:07.8230587Z 2025-05-07T20:33:07.8230775Z if scale_ub is not None: 2025-05-07T20:33:07.8231053Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:07.8231395Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:07.8231715Z ) 2025-05-07T20:33:07.8231900Z else: 2025-05-07T20:33:07.8232114Z scale_ub_tensor = None 2025-05-07T20:33:07.8232372Z 2025-05-07T20:33:07.8232599Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:07.8232981Z op = silu_mul_quant 2025-05-07T20:33:07.8233230Z if compiled: 2025-05-07T20:33:07.8233479Z op = torch.compile(op) 2025-05-07T20:33:07.8233784Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.8234060Z 2025-05-07T20:33:07.8234255Z > y_fp8, y_scale = fn() 2025-05-07T20:33:07.8234423Z 2025-05-07T20:33:07.8234525Z moe/activation_test.py:117: 2025-05-07T20:33:07.8234871Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.8235219Z moe/activation_test.py:115: in fn 2025-05-07T20:33:07.8235504Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.8236243Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:07.8236983Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:07.8237551Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:07.8238289Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:07.8238990Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:07.8239562Z kernel = self.compile( 2025-05-07T20:33:07.8240131Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:07.8240828Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:07.8241236Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.8241483Z 2025-05-07T20:33:07.8241693Z self = 2025-05-07T20:33:07.8242929Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:07.8244464Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd1baa533a0>} 2025-05-07T20:33:07.8245931Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:07.8247073Z context = 2025-05-07T20:33:07.8247384Z 2025-05-07T20:33:07.8247553Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:07.8248104Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:07.8248590Z module_map=module_map) 2025-05-07T20:33:07.8248965Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:07.8249324Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:07.8249588Z E ^ 2025-05-07T20:33:07.8250070Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:07.8250566Z 2025-05-07T20:33:07.8251018Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:07.8251570Z 2025-05-07T20:33:07.8251678Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.8252106Z self=, 2025-05-07T20:33:07.8252524Z T=1, 2025-05-07T20:33:07.8252708Z D=7168, 2025-05-07T20:33:07.8252921Z scale_ub=None, 2025-05-07T20:33:07.8253153Z contiguous=True, 2025-05-07T20:33:07.8253374Z compiled=True, 2025-05-07T20:33:07.8253577Z ) 2025-05-07T20:33:07.8253893Z self = 2025-05-07T20:33:07.8254401Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:07.8254670Z 2025-05-07T20:33:07.8254756Z @given( 2025-05-07T20:33:07.8254978Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.8255304Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.8255621Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.8255958Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.8256340Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.8256639Z ) 2025-05-07T20:33:07.8257003Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.8257461Z def test_silu_mul_quant( 2025-05-07T20:33:07.8257710Z self, 2025-05-07T20:33:07.8257900Z T: int, 2025-05-07T20:33:07.8258093Z D: int, 2025-05-07T20:33:07.8258312Z scale_ub: Optional[float], 2025-05-07T20:33:07.8258592Z contiguous: bool, 2025-05-07T20:33:07.8258824Z compiled: bool, 2025-05-07T20:33:07.8259048Z ) -> None: 2025-05-07T20:33:07.8259264Z torch.manual_seed(2025) 2025-05-07T20:33:07.8259505Z 2025-05-07T20:33:07.8259782Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.8260142Z 2025-05-07T20:33:07.8260328Z x_sign = torch.sign(x) 2025-05-07T20:33:07.8260628Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:07.8260948Z x = x_sign * x_clamp 2025-05-07T20:33:07.8261196Z x0 = x[:, :D] 2025-05-07T20:33:07.8261409Z x1 = x[:, D:] 2025-05-07T20:33:07.8261618Z 2025-05-07T20:33:07.8261804Z if contiguous: 2025-05-07T20:33:07.8262031Z x0 = x0.contiguous() 2025-05-07T20:33:07.8262297Z x1 = x1.contiguous() 2025-05-07T20:33:07.8262542Z 2025-05-07T20:33:07.8262794Z if scale_ub is not None: 2025-05-07T20:33:07.8263095Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:07.8263499Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:07.8263812Z ) 2025-05-07T20:33:07.8264006Z else: 2025-05-07T20:33:07.8264213Z scale_ub_tensor = None 2025-05-07T20:33:07.8264463Z 2025-05-07T20:33:07.8264690Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:07.8265055Z op = silu_mul_quant 2025-05-07T20:33:07.8265303Z if compiled: 2025-05-07T20:33:07.8265554Z op = torch.compile(op) 2025-05-07T20:33:07.8265859Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.8266137Z 2025-05-07T20:33:07.8266318Z y_fp8, y_scale = fn() 2025-05-07T20:33:07.8266605Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:07.8266903Z 2025-05-07T20:33:07.8267134Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:07.8267479Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:07.8267781Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:07.8268097Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:07.8268467Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:07.8268789Z 2025-05-07T20:33:07.8268982Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:07.8269192Z 2025-05-07T20:33:07.8269293Z moe/activation_test.py:126: 2025-05-07T20:33:07.8269597Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.8270056Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:07.8270385Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:07.8271231Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:07.8272052Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:07.8272631Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:07.8273405Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:07.8274144Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:07.8274914Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:07.8275716Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:33:07.8276572Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:07.8277362Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:07.8278049Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:07.8278685Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:07.8279239Z fn() 2025-05-07T20:33:07.8279779Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:07.8280396Z self.fn.run( 2025-05-07T20:33:07.8280887Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:07.8281456Z kernel = self.compile( 2025-05-07T20:33:07.8282032Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:07.8282727Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:07.8283393Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.8283640Z 2025-05-07T20:33:07.8283944Z self = 2025-05-07T20:33:07.8284804Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:07.8285417Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd1ba9ff9d0>} 2025-05-07T20:33:07.8286296Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:07.8286501Z context = 2025-05-07T20:33:07.8286505Z 2025-05-07T20:33:07.8286676Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:07.8286956Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:07.8287073Z module_map=module_map) 2025-05-07T20:33:07.8287235Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:07.8287334Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:07.8287415Z E ^ 2025-05-07T20:33:07.8287802Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:07.8287811Z 2025-05-07T20:33:07.8288264Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:07.8288270Z 2025-05-07T20:33:07.8288374Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.8288606Z self=, 2025-05-07T20:33:07.8288691Z T=4096, 2025-05-07T20:33:07.8288765Z D=5120, 2025-05-07T20:33:07.8288847Z scale_ub=None, 2025-05-07T20:33:07.8288942Z contiguous=False, 2025-05-07T20:33:07.8289028Z compiled=False, 2025-05-07T20:33:07.8289106Z ) 2025-05-07T20:33:07.8289338Z self = 2025-05-07T20:33:07.8289518Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:07.8289522Z 2025-05-07T20:33:07.8289600Z @given( 2025-05-07T20:33:07.8289720Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.8289821Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.8289941Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.8290119Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.8290234Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.8290314Z ) 2025-05-07T20:33:07.8290571Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.8290673Z def test_silu_mul_quant( 2025-05-07T20:33:07.8290755Z self, 2025-05-07T20:33:07.8290833Z T: int, 2025-05-07T20:33:07.8290918Z D: int, 2025-05-07T20:33:07.8291014Z scale_ub: Optional[float], 2025-05-07T20:33:07.8291101Z contiguous: bool, 2025-05-07T20:33:07.8291193Z compiled: bool, 2025-05-07T20:33:07.8291270Z ) -> None: 2025-05-07T20:33:07.8291365Z torch.manual_seed(2025) 2025-05-07T20:33:07.8291447Z 2025-05-07T20:33:07.8291619Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.8291690Z 2025-05-07T20:33:07.8291787Z x_sign = torch.sign(x) 2025-05-07T20:33:07.8291913Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:07.8292010Z x = x_sign * x_clamp 2025-05-07T20:33:07.8292092Z x0 = x[:, :D] 2025-05-07T20:33:07.8292172Z x1 = x[:, D:] 2025-05-07T20:33:07.8292251Z 2025-05-07T20:33:07.8292334Z if contiguous: 2025-05-07T20:33:07.8292423Z x0 = x0.contiguous() 2025-05-07T20:33:07.8292567Z x1 = x1.contiguous() 2025-05-07T20:33:07.8292678Z 2025-05-07T20:33:07.8292772Z if scale_ub is not None: 2025-05-07T20:33:07.8292885Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:07.8293020Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:07.8293096Z ) 2025-05-07T20:33:07.8293181Z else: 2025-05-07T20:33:07.8293275Z scale_ub_tensor = None 2025-05-07T20:33:07.8293388Z 2025-05-07T20:33:07.8293527Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:07.8293619Z op = silu_mul_quant 2025-05-07T20:33:07.8293715Z if compiled: 2025-05-07T20:33:07.8293815Z op = torch.compile(op) 2025-05-07T20:33:07.8293920Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.8294001Z 2025-05-07T20:33:07.8294089Z > y_fp8, y_scale = fn() 2025-05-07T20:33:07.8294094Z 2025-05-07T20:33:07.8294191Z moe/activation_test.py:117: 2025-05-07T20:33:07.8294336Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.8294442Z moe/activation_test.py:115: in fn 2025-05-07T20:33:07.8294542Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.8295097Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:07.8295194Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:07.8295589Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:07.8295825Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:07.8296186Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:07.8296284Z kernel = self.compile( 2025-05-07T20:33:07.8296692Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:07.8296875Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:07.8297009Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.8297014Z 2025-05-07T20:33:07.8297224Z self = 2025-05-07T20:33:07.8298077Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:07.8298886Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd1ba86fe50>} 2025-05-07T20:33:07.8299838Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:07.8300041Z context = 2025-05-07T20:33:07.8300046Z 2025-05-07T20:33:07.8300217Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:07.8300501Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:07.8300614Z module_map=module_map) 2025-05-07T20:33:07.8300786Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:07.8300885Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:07.8300969Z E ^ 2025-05-07T20:33:07.8301363Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:07.8301368Z 2025-05-07T20:33:07.8301817Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:07.8301868Z 2025-05-07T20:33:07.8301981Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.8302248Z self=, 2025-05-07T20:33:07.8302326Z T=4096, 2025-05-07T20:33:07.8302405Z D=7168, 2025-05-07T20:33:07.8302489Z scale_ub=None, 2025-05-07T20:33:07.8302576Z contiguous=False, 2025-05-07T20:33:07.8302683Z compiled=False, 2025-05-07T20:33:07.8302810Z ) 2025-05-07T20:33:07.8303052Z self = 2025-05-07T20:33:07.8303239Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:07.8303246Z 2025-05-07T20:33:07.8303323Z @given( 2025-05-07T20:33:07.8303447Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.8303544Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.8303658Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.8303783Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.8303896Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.8303973Z ) 2025-05-07T20:33:07.8304237Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.8304328Z def test_silu_mul_quant( 2025-05-07T20:33:07.8304401Z self, 2025-05-07T20:33:07.8304482Z T: int, 2025-05-07T20:33:07.8304559Z D: int, 2025-05-07T20:33:07.8304659Z scale_ub: Optional[float], 2025-05-07T20:33:07.8304758Z contiguous: bool, 2025-05-07T20:33:07.8304841Z compiled: bool, 2025-05-07T20:33:07.8304925Z ) -> None: 2025-05-07T20:33:07.8305020Z torch.manual_seed(2025) 2025-05-07T20:33:07.8305094Z 2025-05-07T20:33:07.8305270Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.8305344Z 2025-05-07T20:33:07.8305437Z x_sign = torch.sign(x) 2025-05-07T20:33:07.8305568Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:07.8305656Z x = x_sign * x_clamp 2025-05-07T20:33:07.8305736Z x0 = x[:, :D] 2025-05-07T20:33:07.8305819Z x1 = x[:, D:] 2025-05-07T20:33:07.8305893Z 2025-05-07T20:33:07.8305977Z if contiguous: 2025-05-07T20:33:07.8306074Z x0 = x0.contiguous() 2025-05-07T20:33:07.8306162Z x1 = x1.contiguous() 2025-05-07T20:33:07.8306243Z 2025-05-07T20:33:07.8306341Z if scale_ub is not None: 2025-05-07T20:33:07.8306446Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:07.8306587Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:07.8306707Z ) 2025-05-07T20:33:07.8306784Z else: 2025-05-07T20:33:07.8306886Z scale_ub_tensor = None 2025-05-07T20:33:07.8314983Z 2025-05-07T20:33:07.8315151Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:07.8315254Z op = silu_mul_quant 2025-05-07T20:33:07.8315351Z if compiled: 2025-05-07T20:33:07.8315453Z op = torch.compile(op) 2025-05-07T20:33:07.8315571Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.8315644Z 2025-05-07T20:33:07.8315738Z > y_fp8, y_scale = fn() 2025-05-07T20:33:07.8315755Z 2025-05-07T20:33:07.8315856Z moe/activation_test.py:117: 2025-05-07T20:33:07.8315993Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.8316110Z moe/activation_test.py:115: in fn 2025-05-07T20:33:07.8316213Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.8316771Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:07.8316885Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:07.8317276Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:07.8317591Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:07.8317961Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:07.8318097Z kernel = self.compile( 2025-05-07T20:33:07.8318518Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:07.8318705Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:07.8318877Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.8318882Z 2025-05-07T20:33:07.8319109Z self = 2025-05-07T20:33:07.8319971Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:07.8320536Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd1ba5a8a60>} 2025-05-07T20:33:07.8321359Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:07.8321569Z context = 2025-05-07T20:33:07.8321576Z 2025-05-07T20:33:07.8321750Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:07.8322036Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:07.8322158Z module_map=module_map) 2025-05-07T20:33:07.8322327Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:07.8322428Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:07.8322514Z E ^ 2025-05-07T20:33:07.8322900Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:07.8322907Z 2025-05-07T20:33:07.8323365Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:07.8323369Z 2025-05-07T20:33:07.8323473Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.8323712Z self=, 2025-05-07T20:33:07.8323800Z T=128, 2025-05-07T20:33:07.8323879Z D=7168, 2025-05-07T20:33:07.8323960Z scale_ub=None, 2025-05-07T20:33:07.8324105Z contiguous=False, 2025-05-07T20:33:07.8324191Z compiled=True, 2025-05-07T20:33:07.8324277Z ) 2025-05-07T20:33:07.8324505Z self = 2025-05-07T20:33:07.8324682Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:07.8324690Z 2025-05-07T20:33:07.8324774Z @given( 2025-05-07T20:33:07.8324900Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.8325003Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.8325129Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.8325252Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.8325377Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.8325455Z ) 2025-05-07T20:33:07.8325714Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.8325816Z def test_silu_mul_quant( 2025-05-07T20:33:07.8325899Z self, 2025-05-07T20:33:07.8325979Z T: int, 2025-05-07T20:33:07.8326064Z D: int, 2025-05-07T20:33:07.8326164Z scale_ub: Optional[float], 2025-05-07T20:33:07.8326253Z contiguous: bool, 2025-05-07T20:33:07.8326347Z compiled: bool, 2025-05-07T20:33:07.8326428Z ) -> None: 2025-05-07T20:33:07.8326566Z torch.manual_seed(2025) 2025-05-07T20:33:07.8326653Z 2025-05-07T20:33:07.8326863Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.8326939Z 2025-05-07T20:33:07.8327040Z x_sign = torch.sign(x) 2025-05-07T20:33:07.8327167Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:07.8327266Z x = x_sign * x_clamp 2025-05-07T20:33:07.8327390Z x0 = x[:, :D] 2025-05-07T20:33:07.8327478Z x1 = x[:, D:] 2025-05-07T20:33:07.8327560Z 2025-05-07T20:33:07.8327639Z if contiguous: 2025-05-07T20:33:07.8327743Z x0 = x0.contiguous() 2025-05-07T20:33:07.8327836Z x1 = x1.contiguous() 2025-05-07T20:33:07.8327911Z 2025-05-07T20:33:07.8328006Z if scale_ub is not None: 2025-05-07T20:33:07.8328114Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:07.8328253Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:07.8328338Z ) 2025-05-07T20:33:07.8328417Z else: 2025-05-07T20:33:07.8328523Z scale_ub_tensor = None 2025-05-07T20:33:07.8328594Z 2025-05-07T20:33:07.8328725Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:07.8328818Z op = silu_mul_quant 2025-05-07T20:33:07.8328905Z if compiled: 2025-05-07T20:33:07.8329006Z op = torch.compile(op) 2025-05-07T20:33:07.8329122Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.8329196Z 2025-05-07T20:33:07.8329288Z y_fp8, y_scale = fn() 2025-05-07T20:33:07.8329418Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:07.8329494Z 2025-05-07T20:33:07.8329632Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:07.8329738Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:07.8329838Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:07.8329967Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:07.8330111Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:07.8330186Z 2025-05-07T20:33:07.8330291Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:07.8330296Z 2025-05-07T20:33:07.8330395Z moe/activation_test.py:126: 2025-05-07T20:33:07.8330527Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.8330635Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:07.8330775Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:07.8331437Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:07.8331539Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:07.8331925Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:07.8332167Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:07.8332571Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:07.8332888Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:07.8333316Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:33:07.8333585Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:07.8333994Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:07.8334165Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:07.8334530Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:07.8334613Z fn() 2025-05-07T20:33:07.8335083Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:07.8335210Z self.fn.run( 2025-05-07T20:33:07.8335569Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:07.8335665Z kernel = self.compile( 2025-05-07T20:33:07.8336077Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:07.8336320Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:07.8336454Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.8336463Z 2025-05-07T20:33:07.8336674Z self = 2025-05-07T20:33:07.8337525Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:07.8338084Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd1ba5e2550>} 2025-05-07T20:33:07.8338896Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:07.8339102Z context = 2025-05-07T20:33:07.8339107Z 2025-05-07T20:33:07.8339279Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:07.8339556Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:07.8339670Z module_map=module_map) 2025-05-07T20:33:07.8339836Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:07.8339939Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:07.8340025Z E ^ 2025-05-07T20:33:07.8340409Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:07.8340414Z 2025-05-07T20:33:07.8340868Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:07.8340874Z 2025-05-07T20:33:07.8340976Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.8341205Z self=, 2025-05-07T20:33:07.8341332Z T=128, 2025-05-07T20:33:07.8341411Z D=7168, 2025-05-07T20:33:07.8341499Z scale_ub=None, 2025-05-07T20:33:07.8341585Z contiguous=False, 2025-05-07T20:33:07.8341669Z compiled=False, 2025-05-07T20:33:07.8341746Z ) 2025-05-07T20:33:07.8341972Z self = 2025-05-07T20:33:07.8342154Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:07.8342161Z 2025-05-07T20:33:07.8342245Z @given( 2025-05-07T20:33:07.8342368Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.8342468Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.8342589Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.8342706Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.8342826Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.8342903Z ) 2025-05-07T20:33:07.8343162Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.8343260Z def test_silu_mul_quant( 2025-05-07T20:33:07.8343336Z self, 2025-05-07T20:33:07.8343412Z T: int, 2025-05-07T20:33:07.8343496Z D: int, 2025-05-07T20:33:07.8343595Z scale_ub: Optional[float], 2025-05-07T20:33:07.8343685Z contiguous: bool, 2025-05-07T20:33:07.8343820Z compiled: bool, 2025-05-07T20:33:07.8343900Z ) -> None: 2025-05-07T20:33:07.8344034Z torch.manual_seed(2025) 2025-05-07T20:33:07.8344119Z 2025-05-07T20:33:07.8344292Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.8344373Z 2025-05-07T20:33:07.8344464Z x_sign = torch.sign(x) 2025-05-07T20:33:07.8344588Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:07.8344794Z x = x_sign * x_clamp 2025-05-07T20:33:07.8344876Z x0 = x[:, :D] 2025-05-07T20:33:07.8344958Z x1 = x[:, D:] 2025-05-07T20:33:07.8345036Z 2025-05-07T20:33:07.8345123Z if contiguous: 2025-05-07T20:33:07.8345218Z x0 = x0.contiguous() 2025-05-07T20:33:07.8345315Z x1 = x1.contiguous() 2025-05-07T20:33:07.8345389Z 2025-05-07T20:33:07.8345480Z if scale_ub is not None: 2025-05-07T20:33:07.8345592Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:07.8345731Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:07.8345810Z ) 2025-05-07T20:33:07.8345893Z else: 2025-05-07T20:33:07.8345989Z scale_ub_tensor = None 2025-05-07T20:33:07.8346068Z 2025-05-07T20:33:07.8346198Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:07.8346288Z op = silu_mul_quant 2025-05-07T20:33:07.8346379Z if compiled: 2025-05-07T20:33:07.8346483Z op = torch.compile(op) 2025-05-07T20:33:07.8346589Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.8346667Z 2025-05-07T20:33:07.8346761Z > y_fp8, y_scale = fn() 2025-05-07T20:33:07.8346766Z 2025-05-07T20:33:07.8346865Z moe/activation_test.py:117: 2025-05-07T20:33:07.8347002Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.8347100Z moe/activation_test.py:115: in fn 2025-05-07T20:33:07.8347204Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.8347749Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:07.8347849Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:07.8348238Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:07.8348471Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:07.8348835Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:07.8348931Z kernel = self.compile( 2025-05-07T20:33:07.8349386Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:07.8349569Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:07.8349700Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.8349707Z 2025-05-07T20:33:07.8350063Z self = 2025-05-07T20:33:07.8350921Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:07.8351468Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd1ba5a8ee0>} 2025-05-07T20:33:07.8352293Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:07.8352491Z context = 2025-05-07T20:33:07.8352496Z 2025-05-07T20:33:07.8352723Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:07.8353036Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:07.8353144Z module_map=module_map) 2025-05-07T20:33:07.8353313Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:07.8353408Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:07.8353527Z E ^ 2025-05-07T20:33:07.8353914Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:07.8353919Z 2025-05-07T20:33:07.8354368Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:07.8354373Z 2025-05-07T20:33:07.8354476Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.8354708Z self=, 2025-05-07T20:33:07.8354787Z T=4096, 2025-05-07T20:33:07.8354873Z D=5120, 2025-05-07T20:33:07.8354954Z scale_ub=1200.0, 2025-05-07T20:33:07.8355046Z contiguous=True, 2025-05-07T20:33:07.8355136Z compiled=False, 2025-05-07T20:33:07.8355209Z ) 2025-05-07T20:33:07.8355435Z self = 2025-05-07T20:33:07.8355621Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:07.8355629Z 2025-05-07T20:33:07.8355706Z @given( 2025-05-07T20:33:07.8355834Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.8355933Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.8356050Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.8356175Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.8356287Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.8356363Z ) 2025-05-07T20:33:07.8356629Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.8356724Z def test_silu_mul_quant( 2025-05-07T20:33:07.8356809Z self, 2025-05-07T20:33:07.8356886Z T: int, 2025-05-07T20:33:07.8356961Z D: int, 2025-05-07T20:33:07.8357066Z scale_ub: Optional[float], 2025-05-07T20:33:07.8357155Z contiguous: bool, 2025-05-07T20:33:07.8357239Z compiled: bool, 2025-05-07T20:33:07.8357320Z ) -> None: 2025-05-07T20:33:07.8357419Z torch.manual_seed(2025) 2025-05-07T20:33:07.8357491Z 2025-05-07T20:33:07.8357668Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.8357742Z 2025-05-07T20:33:07.8357878Z x_sign = torch.sign(x) 2025-05-07T20:33:07.8358012Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:07.8358103Z x = x_sign * x_clamp 2025-05-07T20:33:07.8358186Z x0 = x[:, :D] 2025-05-07T20:33:07.8358273Z x1 = x[:, D:] 2025-05-07T20:33:07.8358343Z 2025-05-07T20:33:07.8358433Z if contiguous: 2025-05-07T20:33:07.8358526Z x0 = x0.contiguous() 2025-05-07T20:33:07.8358619Z x1 = x1.contiguous() 2025-05-07T20:33:07.8358700Z 2025-05-07T20:33:07.8358792Z if scale_ub is not None: 2025-05-07T20:33:07.8358899Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:07.8359041Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:07.8359117Z ) 2025-05-07T20:33:07.8359195Z else: 2025-05-07T20:33:07.8359296Z scale_ub_tensor = None 2025-05-07T20:33:07.8359370Z 2025-05-07T20:33:07.8359504Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:07.8359602Z op = silu_mul_quant 2025-05-07T20:33:07.8359687Z if compiled: 2025-05-07T20:33:07.8359795Z op = torch.compile(op) 2025-05-07T20:33:07.8359899Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.8359969Z 2025-05-07T20:33:07.8360103Z > y_fp8, y_scale = fn() 2025-05-07T20:33:07.8360108Z 2025-05-07T20:33:07.8360202Z moe/activation_test.py:117: 2025-05-07T20:33:07.8360373Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.8360477Z moe/activation_test.py:115: in fn 2025-05-07T20:33:07.8360577Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.8361119Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:07.8361266Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:07.8361654Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:07.8361893Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:07.8362260Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:07.8362354Z kernel = self.compile( 2025-05-07T20:33:07.8362771Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:07.8362952Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:07.8363088Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.8363092Z 2025-05-07T20:33:07.8363304Z self = 2025-05-07T20:33:07.8364159Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:07.8364712Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd1ba1350d0>} 2025-05-07T20:33:07.8365527Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:07.8365734Z context = 2025-05-07T20:33:07.8365738Z 2025-05-07T20:33:07.8365908Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:07.8366185Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:07.8366301Z module_map=module_map) 2025-05-07T20:33:07.8366507Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:07.8366613Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:07.8366689Z E ^ 2025-05-07T20:33:07.8367070Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:07.8367075Z 2025-05-07T20:33:07.8367531Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:07.8367538Z 2025-05-07T20:33:07.8367643Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.8367882Z self=, 2025-05-07T20:33:07.8367959Z T=1, 2025-05-07T20:33:07.8368035Z D=5120, 2025-05-07T20:33:07.8368123Z scale_ub=None, 2025-05-07T20:33:07.8368210Z contiguous=True, 2025-05-07T20:33:07.8368293Z compiled=True, 2025-05-07T20:33:07.8368371Z ) 2025-05-07T20:33:07.8368597Z self = 2025-05-07T20:33:07.8368766Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:07.8368771Z 2025-05-07T20:33:07.8368857Z @given( 2025-05-07T20:33:07.8368976Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.8369074Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.8369235Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.8369356Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.8369540Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.8369616Z ) 2025-05-07T20:33:07.8369874Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.8369973Z def test_silu_mul_quant( 2025-05-07T20:33:07.8370049Z self, 2025-05-07T20:33:07.8370164Z T: int, 2025-05-07T20:33:07.8370248Z D: int, 2025-05-07T20:33:07.8370348Z scale_ub: Optional[float], 2025-05-07T20:33:07.8370437Z contiguous: bool, 2025-05-07T20:33:07.8370532Z compiled: bool, 2025-05-07T20:33:07.8370611Z ) -> None: 2025-05-07T20:33:07.8370705Z torch.manual_seed(2025) 2025-05-07T20:33:07.8370787Z 2025-05-07T20:33:07.8370959Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.8371036Z 2025-05-07T20:33:07.8371130Z x_sign = torch.sign(x) 2025-05-07T20:33:07.8371255Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:07.8371357Z x = x_sign * x_clamp 2025-05-07T20:33:07.8371439Z x0 = x[:, :D] 2025-05-07T20:33:07.8371521Z x1 = x[:, D:] 2025-05-07T20:33:07.8371601Z 2025-05-07T20:33:07.8371685Z if contiguous: 2025-05-07T20:33:07.8371775Z x0 = x0.contiguous() 2025-05-07T20:33:07.8371872Z x1 = x1.contiguous() 2025-05-07T20:33:07.8371947Z 2025-05-07T20:33:07.8372038Z if scale_ub is not None: 2025-05-07T20:33:07.8372151Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:07.8372291Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:07.8372374Z ) 2025-05-07T20:33:07.8372451Z else: 2025-05-07T20:33:07.8372546Z scale_ub_tensor = None 2025-05-07T20:33:07.8372623Z 2025-05-07T20:33:07.8372754Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:07.8372849Z op = silu_mul_quant 2025-05-07T20:33:07.8372942Z if compiled: 2025-05-07T20:33:07.8373044Z op = torch.compile(op) 2025-05-07T20:33:07.8373150Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.8373233Z 2025-05-07T20:33:07.8373324Z y_fp8, y_scale = fn() 2025-05-07T20:33:07.8373446Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:07.8373531Z 2025-05-07T20:33:07.8373670Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:07.8373781Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:07.8373928Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:07.8374055Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:07.8374203Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:07.8374276Z 2025-05-07T20:33:07.8374378Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:07.8374382Z 2025-05-07T20:33:07.8374494Z moe/activation_test.py:126: 2025-05-07T20:33:07.8374626Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.8374731Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:07.8374872Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:07.8375478Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:07.8375587Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:07.8375975Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:07.8376209Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:07.8376616Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:07.8376927Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:07.8377357Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:33:07.8377667Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:07.8378071Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:07.8378283Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:07.8378658Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:07.8378736Z fn() 2025-05-07T20:33:07.8379172Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:07.8379256Z self.fn.run( 2025-05-07T20:33:07.8379617Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:07.8379721Z kernel = self.compile( 2025-05-07T20:33:07.8380131Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:07.8380316Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:07.8380448Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.8380455Z 2025-05-07T20:33:07.8380667Z self = 2025-05-07T20:33:07.8381528Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:07.8382078Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd1b9dfb4c0>} 2025-05-07T20:33:07.8383232Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:07.8383458Z context = 2025-05-07T20:33:07.8383463Z 2025-05-07T20:33:07.8383633Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:07.8383922Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:07.8384157Z module_map=module_map) 2025-05-07T20:33:07.8384329Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:07.8384435Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:07.8384511Z E ^ 2025-05-07T20:33:07.8384908Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:07.8384912Z 2025-05-07T20:33:07.8385362Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:07.8385366Z 2025-05-07T20:33:07.8385481Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.8385710Z self=, 2025-05-07T20:33:07.8385784Z T=2048, 2025-05-07T20:33:07.8385866Z D=5120, 2025-05-07T20:33:07.8385951Z scale_ub=None, 2025-05-07T20:33:07.8386036Z contiguous=True, 2025-05-07T20:33:07.8386124Z compiled=True, 2025-05-07T20:33:07.8386196Z ) 2025-05-07T20:33:07.8386423Z self = 2025-05-07T20:33:07.8386605Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:07.8386610Z 2025-05-07T20:33:07.8386686Z @given( 2025-05-07T20:33:07.8386804Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.8386989Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.8387161Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.8387282Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.8387397Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.8387470Z ) 2025-05-07T20:33:07.8387732Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.8387885Z def test_silu_mul_quant( 2025-05-07T20:33:07.8387960Z self, 2025-05-07T20:33:07.8388044Z T: int, 2025-05-07T20:33:07.8388120Z D: int, 2025-05-07T20:33:07.8388219Z scale_ub: Optional[float], 2025-05-07T20:33:07.8388314Z contiguous: bool, 2025-05-07T20:33:07.8388399Z compiled: bool, 2025-05-07T20:33:07.8388486Z ) -> None: 2025-05-07T20:33:07.8388581Z torch.manual_seed(2025) 2025-05-07T20:33:07.8388655Z 2025-05-07T20:33:07.8388838Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.8388912Z 2025-05-07T20:33:07.8389007Z x_sign = torch.sign(x) 2025-05-07T20:33:07.8389138Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:07.8389225Z x = x_sign * x_clamp 2025-05-07T20:33:07.8389307Z x0 = x[:, :D] 2025-05-07T20:33:07.8389391Z x1 = x[:, D:] 2025-05-07T20:33:07.8389462Z 2025-05-07T20:33:07.8389542Z if contiguous: 2025-05-07T20:33:07.8389647Z x0 = x0.contiguous() 2025-05-07T20:33:07.8389736Z x1 = x1.contiguous() 2025-05-07T20:33:07.8389889Z 2025-05-07T20:33:07.8389988Z if scale_ub is not None: 2025-05-07T20:33:07.8390097Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:07.8390237Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:07.8390312Z ) 2025-05-07T20:33:07.8390387Z else: 2025-05-07T20:33:07.8390485Z scale_ub_tensor = None 2025-05-07T20:33:07.8390557Z 2025-05-07T20:33:07.8390687Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:07.8390787Z op = silu_mul_quant 2025-05-07T20:33:07.8390871Z if compiled: 2025-05-07T20:33:07.8390971Z op = torch.compile(op) 2025-05-07T20:33:07.8391082Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.8391154Z 2025-05-07T20:33:07.8391243Z y_fp8, y_scale = fn() 2025-05-07T20:33:07.8391373Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:07.8391447Z 2025-05-07T20:33:07.8391590Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:07.8391740Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:07.8391840Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:07.8391967Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:07.8392108Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:07.8392178Z 2025-05-07T20:33:07.8392286Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:07.8392292Z 2025-05-07T20:33:07.8392388Z moe/activation_test.py:126: 2025-05-07T20:33:07.8392520Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.8392630Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:07.8392765Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:07.8393383Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:07.8393482Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:07.8393870Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:07.8394112Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:07.8394545Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:07.8394820Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:07.8395284Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:33:07.8395549Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:07.8395995Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:07.8396166Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:07.8396541Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:07.8396618Z fn() 2025-05-07T20:33:07.8397047Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:07.8397137Z self.fn.run( 2025-05-07T20:33:07.8397497Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:07.8397592Z kernel = self.compile( 2025-05-07T20:33:07.8398005Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:07.8398184Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:07.8398324Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.8398328Z 2025-05-07T20:33:07.8398538Z self = 2025-05-07T20:33:07.8399385Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:07.8399942Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd1b9a47f70>} 2025-05-07T20:33:07.8400758Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:07.8400959Z context = 2025-05-07T20:33:07.8400967Z 2025-05-07T20:33:07.8401135Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:07.8401453Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:07.8401566Z module_map=module_map) 2025-05-07T20:33:07.8401728Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:07.8401834Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:07.8401910Z E ^ 2025-05-07T20:33:07.8402295Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:07.8402302Z 2025-05-07T20:33:07.8402753Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:07.8402757Z 2025-05-07T20:33:07.8402855Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.8403093Z self=, 2025-05-07T20:33:07.8403173Z T=128, 2025-05-07T20:33:07.8403245Z D=5120, 2025-05-07T20:33:07.8403338Z scale_ub=None, 2025-05-07T20:33:07.8403428Z contiguous=True, 2025-05-07T20:33:07.8403511Z compiled=True, 2025-05-07T20:33:07.8403588Z ) 2025-05-07T20:33:07.8403812Z self = 2025-05-07T20:33:07.8403981Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:07.8403986Z 2025-05-07T20:33:07.8404138Z @given( 2025-05-07T20:33:07.8404259Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.8404397Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.8404515Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.8404631Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.8404748Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.8404822Z ) 2025-05-07T20:33:07.8405122Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.8405217Z def test_silu_mul_quant( 2025-05-07T20:33:07.8405292Z self, 2025-05-07T20:33:07.8405372Z T: int, 2025-05-07T20:33:07.8405455Z D: int, 2025-05-07T20:33:07.8405556Z scale_ub: Optional[float], 2025-05-07T20:33:07.8405643Z contiguous: bool, 2025-05-07T20:33:07.8405732Z compiled: bool, 2025-05-07T20:33:07.8405808Z ) -> None: 2025-05-07T20:33:07.8405905Z torch.manual_seed(2025) 2025-05-07T20:33:07.8405988Z 2025-05-07T20:33:07.8406158Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.8406244Z 2025-05-07T20:33:07.8406334Z x_sign = torch.sign(x) 2025-05-07T20:33:07.8406457Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:07.8406553Z x = x_sign * x_clamp 2025-05-07T20:33:07.8406632Z x0 = x[:, :D] 2025-05-07T20:33:07.8406714Z x1 = x[:, D:] 2025-05-07T20:33:07.8406792Z 2025-05-07T20:33:07.8406875Z if contiguous: 2025-05-07T20:33:07.8406964Z x0 = x0.contiguous() 2025-05-07T20:33:07.8407065Z x1 = x1.contiguous() 2025-05-07T20:33:07.8407140Z 2025-05-07T20:33:07.8407232Z if scale_ub is not None: 2025-05-07T20:33:07.8407342Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:07.8407477Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:07.8407561Z ) 2025-05-07T20:33:07.8407639Z else: 2025-05-07T20:33:07.8407733Z scale_ub_tensor = None 2025-05-07T20:33:07.8407818Z 2025-05-07T20:33:07.8407946Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:07.8408037Z op = silu_mul_quant 2025-05-07T20:33:07.8408128Z if compiled: 2025-05-07T20:33:07.8408227Z op = torch.compile(op) 2025-05-07T20:33:07.8408331Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.8408414Z 2025-05-07T20:33:07.8408504Z y_fp8, y_scale = fn() 2025-05-07T20:33:07.8408625Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:07.8408697Z 2025-05-07T20:33:07.8408883Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:07.8408992Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:07.8409088Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:07.8409211Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:07.8409366Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:07.8409437Z 2025-05-07T20:33:07.8409537Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:07.8409542Z 2025-05-07T20:33:07.8409648Z moe/activation_test.py:126: 2025-05-07T20:33:07.8409779Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.8409883Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:07.8410029Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:07.8410641Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:07.8410745Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:07.8411130Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:07.8411362Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:07.8411808Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:07.8412113Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:07.8412545Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:33:07.8412812Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:07.8413252Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:07.8413434Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:07.8413798Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:07.8413884Z fn() 2025-05-07T20:33:07.8414315Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:07.8414401Z self.fn.run( 2025-05-07T20:33:07.8414765Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:07.8414859Z kernel = self.compile( 2025-05-07T20:33:07.8415266Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:07.8415456Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:07.8415586Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.8415590Z 2025-05-07T20:33:07.8415808Z self = 2025-05-07T20:33:07.8416658Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:07.8417206Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd1b9d0f0d0>} 2025-05-07T20:33:07.8418027Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:07.8418227Z context = 2025-05-07T20:33:07.8418232Z 2025-05-07T20:33:07.8418457Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:07.8418734Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:07.8418838Z module_map=module_map) 2025-05-07T20:33:07.8419006Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:07.8419112Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:07.8419191Z E ^ 2025-05-07T20:33:07.8419574Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:07.8419579Z 2025-05-07T20:33:07.8420024Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:07.8420029Z 2025-05-07T20:33:07.8420144Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.8420372Z self=, 2025-05-07T20:33:07.8420457Z T=4096, 2025-05-07T20:33:07.8420537Z D=5120, 2025-05-07T20:33:07.8420618Z scale_ub=None, 2025-05-07T20:33:07.8420710Z contiguous=True, 2025-05-07T20:33:07.8420790Z compiled=True, 2025-05-07T20:33:07.8420863Z ) 2025-05-07T20:33:07.8421098Z self = 2025-05-07T20:33:07.8421313Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:07.8421355Z 2025-05-07T20:33:07.8421430Z @given( 2025-05-07T20:33:07.8421555Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.8421651Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.8421766Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.8421888Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.8422042Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.8422124Z ) 2025-05-07T20:33:07.8422381Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.8422474Z def test_silu_mul_quant( 2025-05-07T20:33:07.8422558Z self, 2025-05-07T20:33:07.8422632Z T: int, 2025-05-07T20:33:07.8422709Z D: int, 2025-05-07T20:33:07.8422812Z scale_ub: Optional[float], 2025-05-07T20:33:07.8422898Z contiguous: bool, 2025-05-07T20:33:07.8422984Z compiled: bool, 2025-05-07T20:33:07.8423067Z ) -> None: 2025-05-07T20:33:07.8423166Z torch.manual_seed(2025) 2025-05-07T20:33:07.8423239Z 2025-05-07T20:33:07.8423418Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.8423495Z 2025-05-07T20:33:07.8423591Z x_sign = torch.sign(x) 2025-05-07T20:33:07.8423712Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:07.8423802Z x = x_sign * x_clamp 2025-05-07T20:33:07.8423887Z x0 = x[:, :D] 2025-05-07T20:33:07.8423967Z x1 = x[:, D:] 2025-05-07T20:33:07.8424038Z 2025-05-07T20:33:07.8424129Z if contiguous: 2025-05-07T20:33:07.8424221Z x0 = x0.contiguous() 2025-05-07T20:33:07.8424309Z x1 = x1.contiguous() 2025-05-07T20:33:07.8424389Z 2025-05-07T20:33:07.8424480Z if scale_ub is not None: 2025-05-07T20:33:07.8424582Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:07.8424726Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:07.8424803Z ) 2025-05-07T20:33:07.8424888Z else: 2025-05-07T20:33:07.8424983Z scale_ub_tensor = None 2025-05-07T20:33:07.8425056Z 2025-05-07T20:33:07.8425192Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:07.8425283Z op = silu_mul_quant 2025-05-07T20:33:07.8425365Z if compiled: 2025-05-07T20:33:07.8425473Z op = torch.compile(op) 2025-05-07T20:33:07.8425578Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.8425646Z 2025-05-07T20:33:07.8425743Z y_fp8, y_scale = fn() 2025-05-07T20:33:07.8425910Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:07.8425984Z 2025-05-07T20:33:07.8426128Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:07.8426229Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:07.8426336Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:07.8426461Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:07.8426603Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:07.8426686Z 2025-05-07T20:33:07.8426785Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:07.8426790Z 2025-05-07T20:33:07.8426887Z moe/activation_test.py:126: 2025-05-07T20:33:07.8427026Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.8427131Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:07.8427266Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:07.8427880Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:07.8427978Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:07.8428369Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:07.8428644Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:07.8429077Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:07.8429352Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:07.8429848Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:33:07.8430167Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:07.8430572Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:07.8430742Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:07.8431114Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:07.8431195Z fn() 2025-05-07T20:33:07.8431638Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:07.8431722Z self.fn.run( 2025-05-07T20:33:07.8432080Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:07.8432179Z kernel = self.compile( 2025-05-07T20:33:07.8432590Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:07.8432800Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:07.8432955Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.8432960Z 2025-05-07T20:33:07.8433170Z self = 2025-05-07T20:33:07.8434033Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:07.8434589Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd1b981ddc0>} 2025-05-07T20:33:07.8435408Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:07.8435614Z context = 2025-05-07T20:33:07.8435684Z 2025-05-07T20:33:07.8435857Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:07.8436142Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:07.8436247Z module_map=module_map) 2025-05-07T20:33:07.8436413Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:07.8436523Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:07.8436600Z E ^ 2025-05-07T20:33:07.8436987Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:07.8436992Z 2025-05-07T20:33:07.8437441Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:07.8437447Z 2025-05-07T20:33:07.8437547Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.8437787Z self=, 2025-05-07T20:33:07.8437866Z T=16384, 2025-05-07T20:33:07.8437941Z D=5120, 2025-05-07T20:33:07.8438030Z scale_ub=None, 2025-05-07T20:33:07.8438114Z contiguous=True, 2025-05-07T20:33:07.8438203Z compiled=True, 2025-05-07T20:33:07.8438276Z ) 2025-05-07T20:33:07.8438542Z self = 2025-05-07T20:33:07.8438775Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:07.8438780Z 2025-05-07T20:33:07.8438858Z @given( 2025-05-07T20:33:07.8438977Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.8439082Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.8439198Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.8439353Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.8439471Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.8439546Z ) 2025-05-07T20:33:07.8439813Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.8439906Z def test_silu_mul_quant( 2025-05-07T20:33:07.8439983Z self, 2025-05-07T20:33:07.8440068Z T: int, 2025-05-07T20:33:07.8440141Z D: int, 2025-05-07T20:33:07.8440241Z scale_ub: Optional[float], 2025-05-07T20:33:07.8440337Z contiguous: bool, 2025-05-07T20:33:07.8440427Z compiled: bool, 2025-05-07T20:33:07.8440507Z ) -> None: 2025-05-07T20:33:07.8440609Z torch.manual_seed(2025) 2025-05-07T20:33:07.8440683Z 2025-05-07T20:33:07.8440853Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.8440931Z 2025-05-07T20:33:07.8441018Z x_sign = torch.sign(x) 2025-05-07T20:33:07.8441149Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:07.8441237Z x = x_sign * x_clamp 2025-05-07T20:33:07.8441317Z x0 = x[:, :D] 2025-05-07T20:33:07.8441405Z x1 = x[:, D:] 2025-05-07T20:33:07.8441475Z 2025-05-07T20:33:07.8441556Z if contiguous: 2025-05-07T20:33:07.8441660Z x0 = x0.contiguous() 2025-05-07T20:33:07.8441748Z x1 = x1.contiguous() 2025-05-07T20:33:07.8441820Z 2025-05-07T20:33:07.8441918Z if scale_ub is not None: 2025-05-07T20:33:07.8442025Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:07.8442160Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:07.8442242Z ) 2025-05-07T20:33:07.8442316Z else: 2025-05-07T20:33:07.8442416Z scale_ub_tensor = None 2025-05-07T20:33:07.8442492Z 2025-05-07T20:33:07.8442623Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:07.8442724Z op = silu_mul_quant 2025-05-07T20:33:07.8442809Z if compiled: 2025-05-07T20:33:07.8442911Z op = torch.compile(op) 2025-05-07T20:33:07.8443023Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.8443147Z 2025-05-07T20:33:07.8443240Z y_fp8, y_scale = fn() 2025-05-07T20:33:07.8443368Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:07.8443442Z 2025-05-07T20:33:07.8463027Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:07.8463169Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:07.8463272Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:07.8463402Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:07.8463548Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:07.8463621Z 2025-05-07T20:33:07.8463723Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:07.8463729Z 2025-05-07T20:33:07.8463835Z moe/activation_test.py:126: 2025-05-07T20:33:07.8463976Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.8464092Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:07.8464235Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:07.8464871Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:07.8464990Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:07.8465477Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:07.8465770Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:07.8466171Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:07.8466440Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:07.8466939Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:33:07.8467206Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:07.8467618Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:07.8467791Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:07.8468159Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:07.8468247Z fn() 2025-05-07T20:33:07.8468676Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:07.8468756Z self.fn.run( 2025-05-07T20:33:07.8469126Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:07.8469225Z kernel = self.compile( 2025-05-07T20:33:07.8469643Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:07.8469981Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:07.8470117Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.8470122Z 2025-05-07T20:33:07.8470343Z self = 2025-05-07T20:33:07.8471199Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:07.8471762Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd1b9aed310>} 2025-05-07T20:33:07.8472631Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:07.8472831Z context = 2025-05-07T20:33:07.8472836Z 2025-05-07T20:33:07.8473015Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:07.8473295Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:07.8473412Z module_map=module_map) 2025-05-07T20:33:07.8473580Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:07.8473684Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:07.8473774Z E ^ 2025-05-07T20:33:07.8474157Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:07.8474164Z 2025-05-07T20:33:07.8474617Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:07.8474622Z 2025-05-07T20:33:07.8474725Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.8474958Z self=, 2025-05-07T20:33:07.8475043Z T=1, 2025-05-07T20:33:07.8475120Z D=5120, 2025-05-07T20:33:07.8475204Z scale_ub=1200.0, 2025-05-07T20:33:07.8475298Z contiguous=True, 2025-05-07T20:33:07.8475422Z compiled=True, 2025-05-07T20:33:07.8475500Z ) 2025-05-07T20:33:07.8475771Z self = 2025-05-07T20:33:07.8475942Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:07.8475946Z 2025-05-07T20:33:07.8476028Z @given( 2025-05-07T20:33:07.8476147Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.8476284Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.8476406Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.8476523Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.8476640Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.8476723Z ) 2025-05-07T20:33:07.8476982Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.8477078Z def test_silu_mul_quant( 2025-05-07T20:33:07.8477158Z self, 2025-05-07T20:33:07.8477236Z T: int, 2025-05-07T20:33:07.8477321Z D: int, 2025-05-07T20:33:07.8477417Z scale_ub: Optional[float], 2025-05-07T20:33:07.8477509Z contiguous: bool, 2025-05-07T20:33:07.8477599Z compiled: bool, 2025-05-07T20:33:07.8477676Z ) -> None: 2025-05-07T20:33:07.8477769Z torch.manual_seed(2025) 2025-05-07T20:33:07.8477847Z 2025-05-07T20:33:07.8478021Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.8478094Z 2025-05-07T20:33:07.8478194Z x_sign = torch.sign(x) 2025-05-07T20:33:07.8478317Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:07.8478406Z x = x_sign * x_clamp 2025-05-07T20:33:07.8478496Z x0 = x[:, :D] 2025-05-07T20:33:07.8478571Z x1 = x[:, D:] 2025-05-07T20:33:07.8478652Z 2025-05-07T20:33:07.8478733Z if contiguous: 2025-05-07T20:33:07.8478821Z x0 = x0.contiguous() 2025-05-07T20:33:07.8478914Z x1 = x1.contiguous() 2025-05-07T20:33:07.8478988Z 2025-05-07T20:33:07.8479083Z if scale_ub is not None: 2025-05-07T20:33:07.8479190Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:07.8479326Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:07.8479404Z ) 2025-05-07T20:33:07.8479479Z else: 2025-05-07T20:33:07.8479572Z scale_ub_tensor = None 2025-05-07T20:33:07.8479646Z 2025-05-07T20:33:07.8479778Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:07.8479866Z op = silu_mul_quant 2025-05-07T20:33:07.8479953Z if compiled: 2025-05-07T20:33:07.8480099Z op = torch.compile(op) 2025-05-07T20:33:07.8480205Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.8480285Z 2025-05-07T20:33:07.8480373Z > y_fp8, y_scale = fn() 2025-05-07T20:33:07.8480378Z 2025-05-07T20:33:07.8480480Z moe/activation_test.py:117: 2025-05-07T20:33:07.8480612Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.8480711Z moe/activation_test.py:115: in fn 2025-05-07T20:33:07.8480818Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.8481208Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:07.8481298Z return fn(*args, **kwargs) 2025-05-07T20:33:07.8481838Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:07.8481938Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:07.8482328Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:07.8482560Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:07.8483251Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:07.8483536Z kernel = self.compile( 2025-05-07T20:33:07.8483949Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:07.8484236Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:07.8484371Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.8484376Z 2025-05-07T20:33:07.8484590Z self = 2025-05-07T20:33:07.8485519Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:07.8486071Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd1b90549d0>} 2025-05-07T20:33:07.8486895Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:07.8487094Z context = 2025-05-07T20:33:07.8487098Z 2025-05-07T20:33:07.8487265Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:07.8487551Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:07.8487661Z module_map=module_map) 2025-05-07T20:33:07.8487832Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:07.8487929Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:07.8488005Z E ^ 2025-05-07T20:33:07.8488392Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:07.8488397Z 2025-05-07T20:33:07.8488842Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:07.8488849Z 2025-05-07T20:33:07.8488951Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.8489191Z self=, 2025-05-07T20:33:07.8489265Z T=1, 2025-05-07T20:33:07.8489346Z D=5120, 2025-05-07T20:33:07.8489430Z scale_ub=None, 2025-05-07T20:33:07.8489513Z contiguous=False, 2025-05-07T20:33:07.8489599Z compiled=True, 2025-05-07T20:33:07.8489671Z ) 2025-05-07T20:33:07.8489961Z self = 2025-05-07T20:33:07.8490140Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:07.8490145Z 2025-05-07T20:33:07.8490222Z @given( 2025-05-07T20:33:07.8490341Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.8490443Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.8490561Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.8490685Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.8490796Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.8490868Z ) 2025-05-07T20:33:07.8491133Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.8491223Z def test_silu_mul_quant( 2025-05-07T20:33:07.8491299Z self, 2025-05-07T20:33:07.8491381Z T: int, 2025-05-07T20:33:07.8491456Z D: int, 2025-05-07T20:33:07.8491554Z scale_ub: Optional[float], 2025-05-07T20:33:07.8491653Z contiguous: bool, 2025-05-07T20:33:07.8491738Z compiled: bool, 2025-05-07T20:33:07.8491817Z ) -> None: 2025-05-07T20:33:07.8491918Z torch.manual_seed(2025) 2025-05-07T20:33:07.8491989Z 2025-05-07T20:33:07.8492167Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.8492240Z 2025-05-07T20:33:07.8492376Z x_sign = torch.sign(x) 2025-05-07T20:33:07.8492546Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:07.8492634Z x = x_sign * x_clamp 2025-05-07T20:33:07.8492713Z x0 = x[:, :D] 2025-05-07T20:33:07.8492801Z x1 = x[:, D:] 2025-05-07T20:33:07.8492875Z 2025-05-07T20:33:07.8492956Z if contiguous: 2025-05-07T20:33:07.8493052Z x0 = x0.contiguous() 2025-05-07T20:33:07.8493183Z x1 = x1.contiguous() 2025-05-07T20:33:07.8493257Z 2025-05-07T20:33:07.8493354Z if scale_ub is not None: 2025-05-07T20:33:07.8493457Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:07.8493602Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:07.8493676Z ) 2025-05-07T20:33:07.8493751Z else: 2025-05-07T20:33:07.8493853Z scale_ub_tensor = None 2025-05-07T20:33:07.8493923Z 2025-05-07T20:33:07.8494054Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:07.8494150Z op = silu_mul_quant 2025-05-07T20:33:07.8494241Z if compiled: 2025-05-07T20:33:07.8494340Z op = torch.compile(op) 2025-05-07T20:33:07.8494452Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.8494525Z 2025-05-07T20:33:07.8494615Z y_fp8, y_scale = fn() 2025-05-07T20:33:07.8494743Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:07.8494819Z 2025-05-07T20:33:07.8494960Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:07.8495062Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:07.8495163Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:07.8495289Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:07.8495429Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:07.8495504Z 2025-05-07T20:33:07.8495606Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:07.8495615Z 2025-05-07T20:33:07.8495708Z moe/activation_test.py:126: 2025-05-07T20:33:07.8495840Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.8495949Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:07.8496089Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:07.8496705Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:07.8496807Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:07.8497236Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:07.8497474Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:07.8497864Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:07.8498139Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:07.8498568Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:33:07.8498831Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:07.8499238Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:07.8499408Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:07.8499775Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:07.8499853Z fn() 2025-05-07T20:33:07.8500280Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:07.8500365Z self.fn.run( 2025-05-07T20:33:07.8500762Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:07.8500918Z kernel = self.compile( 2025-05-07T20:33:07.8501330Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:07.8501505Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:07.8501633Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.8501683Z 2025-05-07T20:33:07.8501895Z self = 2025-05-07T20:33:07.8502745Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:07.8503303Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd1b97ede50>} 2025-05-07T20:33:07.8504114Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:07.8504315Z context = 2025-05-07T20:33:07.8504320Z 2025-05-07T20:33:07.8504490Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:07.8504764Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:07.8504879Z module_map=module_map) 2025-05-07T20:33:07.8505040Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:07.8505143Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:07.8505214Z E ^ 2025-05-07T20:33:07.8505596Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:07.8505604Z 2025-05-07T20:33:07.8506055Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:07.8506060Z 2025-05-07T20:33:07.8506159Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.8506388Z self=, 2025-05-07T20:33:07.8506473Z T=1, 2025-05-07T20:33:07.8506549Z D=5120, 2025-05-07T20:33:07.8506635Z scale_ub=None, 2025-05-07T20:33:07.8506718Z contiguous=True, 2025-05-07T20:33:07.8506802Z compiled=False, 2025-05-07T20:33:07.8506920Z ) 2025-05-07T20:33:07.8507146Z self = 2025-05-07T20:33:07.8507314Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:07.8507318Z 2025-05-07T20:33:07.8507400Z @given( 2025-05-07T20:33:07.8507520Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.8507617Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.8507740Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.8507856Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.8507973Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.8508047Z ) 2025-05-07T20:33:07.8508305Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.8508407Z def test_silu_mul_quant( 2025-05-07T20:33:07.8508482Z self, 2025-05-07T20:33:07.8508555Z T: int, 2025-05-07T20:33:07.8508636Z D: int, 2025-05-07T20:33:07.8508738Z scale_ub: Optional[float], 2025-05-07T20:33:07.8508826Z contiguous: bool, 2025-05-07T20:33:07.8508916Z compiled: bool, 2025-05-07T20:33:07.8508991Z ) -> None: 2025-05-07T20:33:07.8509086Z torch.manual_seed(2025) 2025-05-07T20:33:07.8509166Z 2025-05-07T20:33:07.8509377Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.8509495Z 2025-05-07T20:33:07.8509585Z x_sign = torch.sign(x) 2025-05-07T20:33:07.8509709Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:07.8510017Z x = x_sign * x_clamp 2025-05-07T20:33:07.8510100Z x0 = x[:, :D] 2025-05-07T20:33:07.8510178Z x1 = x[:, D:] 2025-05-07T20:33:07.8510260Z 2025-05-07T20:33:07.8510388Z if contiguous: 2025-05-07T20:33:07.8510479Z x0 = x0.contiguous() 2025-05-07T20:33:07.8510572Z x1 = x1.contiguous() 2025-05-07T20:33:07.8510644Z 2025-05-07T20:33:07.8510737Z if scale_ub is not None: 2025-05-07T20:33:07.8510847Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:07.8510982Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:07.8511059Z ) 2025-05-07T20:33:07.8511134Z else: 2025-05-07T20:33:07.8511227Z scale_ub_tensor = None 2025-05-07T20:33:07.8511311Z 2025-05-07T20:33:07.8511439Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:07.8511531Z op = silu_mul_quant 2025-05-07T20:33:07.8511623Z if compiled: 2025-05-07T20:33:07.8511721Z op = torch.compile(op) 2025-05-07T20:33:07.8511825Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.8511898Z 2025-05-07T20:33:07.8511986Z > y_fp8, y_scale = fn() 2025-05-07T20:33:07.8511994Z 2025-05-07T20:33:07.8512088Z moe/activation_test.py:117: 2025-05-07T20:33:07.8512224Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.8512325Z moe/activation_test.py:115: in fn 2025-05-07T20:33:07.8512428Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.8512970Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:07.8513068Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:07.8513458Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:07.8513691Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:07.8514057Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:07.8514147Z kernel = self.compile( 2025-05-07T20:33:07.8514557Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:07.8514785Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:07.8514914Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.8514918Z 2025-05-07T20:33:07.8515130Z self = 2025-05-07T20:33:07.8515985Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:07.8516536Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd1b906adc0>} 2025-05-07T20:33:07.8517355Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:07.8517553Z context = 2025-05-07T20:33:07.8517558Z 2025-05-07T20:33:07.8517736Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:07.8518011Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:07.8518157Z module_map=module_map) 2025-05-07T20:33:07.8518325Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:07.8518466Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:07.8518542Z E ^ 2025-05-07T20:33:07.8518928Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:07.8518933Z 2025-05-07T20:33:07.8519417Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:07.8519422Z 2025-05-07T20:33:07.8519529Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.8519761Z self=, 2025-05-07T20:33:07.8519836Z T=128, 2025-05-07T20:33:07.8519921Z D=5120, 2025-05-07T20:33:07.8520003Z scale_ub=None, 2025-05-07T20:33:07.8520090Z contiguous=False, 2025-05-07T20:33:07.8520179Z compiled=True, 2025-05-07T20:33:07.8520252Z ) 2025-05-07T20:33:07.8520481Z self = 2025-05-07T20:33:07.8520662Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:07.8520666Z 2025-05-07T20:33:07.8520740Z @given( 2025-05-07T20:33:07.8520865Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.8520961Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.8521079Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.8521202Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.8521312Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.8521384Z ) 2025-05-07T20:33:07.8521649Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.8521738Z def test_silu_mul_quant( 2025-05-07T20:33:07.8521817Z self, 2025-05-07T20:33:07.8521892Z T: int, 2025-05-07T20:33:07.8521966Z D: int, 2025-05-07T20:33:07.8522072Z scale_ub: Optional[float], 2025-05-07T20:33:07.8522164Z contiguous: bool, 2025-05-07T20:33:07.8522250Z compiled: bool, 2025-05-07T20:33:07.8522333Z ) -> None: 2025-05-07T20:33:07.8522426Z torch.manual_seed(2025) 2025-05-07T20:33:07.8522496Z 2025-05-07T20:33:07.8522672Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.8522748Z 2025-05-07T20:33:07.8522844Z x_sign = torch.sign(x) 2025-05-07T20:33:07.8522972Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:07.8523064Z x = x_sign * x_clamp 2025-05-07T20:33:07.8523197Z x0 = x[:, :D] 2025-05-07T20:33:07.8523278Z x1 = x[:, D:] 2025-05-07T20:33:07.8523351Z 2025-05-07T20:33:07.8523437Z if contiguous: 2025-05-07T20:33:07.8523526Z x0 = x0.contiguous() 2025-05-07T20:33:07.8523615Z x1 = x1.contiguous() 2025-05-07T20:33:07.8523690Z 2025-05-07T20:33:07.8523783Z if scale_ub is not None: 2025-05-07T20:33:07.8523889Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:07.8524032Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:07.8524107Z ) 2025-05-07T20:33:07.8524182Z else: 2025-05-07T20:33:07.8524282Z scale_ub_tensor = None 2025-05-07T20:33:07.8524354Z 2025-05-07T20:33:07.8524486Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:07.8524582Z op = silu_mul_quant 2025-05-07T20:33:07.8524663Z if compiled: 2025-05-07T20:33:07.8524767Z op = torch.compile(op) 2025-05-07T20:33:07.8524874Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.8524947Z 2025-05-07T20:33:07.8525045Z > y_fp8, y_scale = fn() 2025-05-07T20:33:07.8525050Z 2025-05-07T20:33:07.8525144Z moe/activation_test.py:117: 2025-05-07T20:33:07.8525275Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.8525425Z moe/activation_test.py:115: in fn 2025-05-07T20:33:07.8525525Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.8525960Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:07.8526057Z return fn(*args, **kwargs) 2025-05-07T20:33:07.8526591Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:07.8526734Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:07.8527118Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:07.8527349Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:07.8527718Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:07.8527810Z kernel = self.compile( 2025-05-07T20:33:07.8528226Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:07.8528415Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:07.8528543Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.8528547Z 2025-05-07T20:33:07.8528757Z self = 2025-05-07T20:33:07.8529616Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:07.8530162Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd1b88a7040>} 2025-05-07T20:33:07.8530984Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:07.8531182Z context = 2025-05-07T20:33:07.8531186Z 2025-05-07T20:33:07.8531358Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:07.8531632Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:07.8531739Z module_map=module_map) 2025-05-07T20:33:07.8531907Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:07.8532047Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:07.8532124Z E ^ 2025-05-07T20:33:07.8532508Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:07.8532513Z 2025-05-07T20:33:07.8532957Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:07.8532963Z 2025-05-07T20:33:07.8533071Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.8533299Z self=, 2025-05-07T20:33:07.8533376Z T=128, 2025-05-07T20:33:07.8533457Z D=7168, 2025-05-07T20:33:07.8533537Z scale_ub=1200.0, 2025-05-07T20:33:07.8533624Z contiguous=False, 2025-05-07T20:33:07.8533716Z compiled=False, 2025-05-07T20:33:07.8533788Z ) 2025-05-07T20:33:07.8534014Z self = 2025-05-07T20:33:07.8534200Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:07.8534205Z 2025-05-07T20:33:07.8534278Z @given( 2025-05-07T20:33:07.8534402Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.8534500Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.8534678Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.8534802Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.8534955Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.8535025Z ) 2025-05-07T20:33:07.8535295Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.8535386Z def test_silu_mul_quant( 2025-05-07T20:33:07.8535468Z self, 2025-05-07T20:33:07.8535582Z T: int, 2025-05-07T20:33:07.8535657Z D: int, 2025-05-07T20:33:07.8535760Z scale_ub: Optional[float], 2025-05-07T20:33:07.8535846Z contiguous: bool, 2025-05-07T20:33:07.8535935Z compiled: bool, 2025-05-07T20:33:07.8536018Z ) -> None: 2025-05-07T20:33:07.8536109Z torch.manual_seed(2025) 2025-05-07T20:33:07.8536178Z 2025-05-07T20:33:07.8536356Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.8536434Z 2025-05-07T20:33:07.8536528Z x_sign = torch.sign(x) 2025-05-07T20:33:07.8536657Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:07.8536749Z x = x_sign * x_clamp 2025-05-07T20:33:07.8536834Z x0 = x[:, :D] 2025-05-07T20:33:07.8536910Z x1 = x[:, D:] 2025-05-07T20:33:07.8536981Z 2025-05-07T20:33:07.8537071Z if contiguous: 2025-05-07T20:33:07.8537159Z x0 = x0.contiguous() 2025-05-07T20:33:07.8537245Z x1 = x1.contiguous() 2025-05-07T20:33:07.8537328Z 2025-05-07T20:33:07.8537416Z if scale_ub is not None: 2025-05-07T20:33:07.8537518Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:07.8537659Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:07.8537734Z ) 2025-05-07T20:33:07.8537810Z else: 2025-05-07T20:33:07.8537912Z scale_ub_tensor = None 2025-05-07T20:33:07.8537987Z 2025-05-07T20:33:07.8538114Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:07.8538211Z op = silu_mul_quant 2025-05-07T20:33:07.8538295Z if compiled: 2025-05-07T20:33:07.8538402Z op = torch.compile(op) 2025-05-07T20:33:07.8538505Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.8538577Z 2025-05-07T20:33:07.8538671Z > y_fp8, y_scale = fn() 2025-05-07T20:33:07.8538675Z 2025-05-07T20:33:07.8538772Z moe/activation_test.py:117: 2025-05-07T20:33:07.8538903Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.8539010Z moe/activation_test.py:115: in fn 2025-05-07T20:33:07.8539107Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.8539691Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:07.8539795Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:07.8540175Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:07.8540414Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:07.8540776Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:07.8540867Z kernel = self.compile( 2025-05-07T20:33:07.8541281Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:07.8541459Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:07.8541601Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.8541607Z 2025-05-07T20:33:07.8541819Z self = 2025-05-07T20:33:07.8542706Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:07.8543261Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd1b88a7c10>} 2025-05-07T20:33:07.8544114Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:07.8544357Z context = 2025-05-07T20:33:07.8544361Z 2025-05-07T20:33:07.8544533Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:07.8544808Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:07.8544922Z module_map=module_map) 2025-05-07T20:33:07.8545086Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:07.8545191Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:07.8545268Z E ^ 2025-05-07T20:33:07.8545652Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:07.8545657Z 2025-05-07T20:33:07.8546107Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:07.8546118Z 2025-05-07T20:33:07.8546218Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.8546456Z self=, 2025-05-07T20:33:07.8546534Z T=128, 2025-05-07T20:33:07.8546615Z D=5120, 2025-05-07T20:33:07.8546704Z scale_ub=None, 2025-05-07T20:33:07.8546789Z contiguous=False, 2025-05-07T20:33:07.8546870Z compiled=False, 2025-05-07T20:33:07.8546952Z ) 2025-05-07T20:33:07.8547175Z self = 2025-05-07T20:33:07.8547351Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:07.8547357Z 2025-05-07T20:33:07.8547440Z @given( 2025-05-07T20:33:07.8547557Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.8547662Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.8547776Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.8547891Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.8548011Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.8548082Z ) 2025-05-07T20:33:07.8548339Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.8548480Z def test_silu_mul_quant( 2025-05-07T20:33:07.8548559Z self, 2025-05-07T20:33:07.8548635Z T: int, 2025-05-07T20:33:07.8548715Z D: int, 2025-05-07T20:33:07.8548812Z scale_ub: Optional[float], 2025-05-07T20:33:07.8548897Z contiguous: bool, 2025-05-07T20:33:07.8548989Z compiled: bool, 2025-05-07T20:33:07.8549066Z ) -> None: 2025-05-07T20:33:07.8549165Z torch.manual_seed(2025) 2025-05-07T20:33:07.8549236Z 2025-05-07T20:33:07.8549408Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.8549486Z 2025-05-07T20:33:07.8549576Z x_sign = torch.sign(x) 2025-05-07T20:33:07.8549703Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:07.8549896Z x = x_sign * x_clamp 2025-05-07T20:33:07.8549975Z x0 = x[:, :D] 2025-05-07T20:33:07.8550054Z x1 = x[:, D:] 2025-05-07T20:33:07.8550130Z 2025-05-07T20:33:07.8550210Z if contiguous: 2025-05-07T20:33:07.8550301Z x0 = x0.contiguous() 2025-05-07T20:33:07.8550393Z x1 = x1.contiguous() 2025-05-07T20:33:07.8550467Z 2025-05-07T20:33:07.8550556Z if scale_ub is not None: 2025-05-07T20:33:07.8550666Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:07.8550845Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:07.8550927Z ) 2025-05-07T20:33:07.8551038Z else: 2025-05-07T20:33:07.8551131Z scale_ub_tensor = None 2025-05-07T20:33:07.8551208Z 2025-05-07T20:33:07.8551335Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:07.8551421Z op = silu_mul_quant 2025-05-07T20:33:07.8551509Z if compiled: 2025-05-07T20:33:07.8551645Z op = torch.compile(op) 2025-05-07T20:33:07.8551748Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.8551825Z 2025-05-07T20:33:07.8551914Z > y_fp8, y_scale = fn() 2025-05-07T20:33:07.8551922Z 2025-05-07T20:33:07.8552021Z moe/activation_test.py:117: 2025-05-07T20:33:07.8552152Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.8552250Z moe/activation_test.py:115: in fn 2025-05-07T20:33:07.8552353Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.8552898Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:07.8552995Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:07.8553386Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:07.8553616Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:07.8553987Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:07.8554077Z kernel = self.compile( 2025-05-07T20:33:07.8554486Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:07.8556142Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:07.8556270Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.8556274Z 2025-05-07T20:33:07.8556486Z self = 2025-05-07T20:33:07.8557341Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:07.8557887Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd1b887c310>} 2025-05-07T20:33:07.8558747Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:07.8558944Z context = 2025-05-07T20:33:07.8558948Z 2025-05-07T20:33:07.8559122Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:07.8559396Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:07.8559504Z module_map=module_map) 2025-05-07T20:33:07.8559670Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:07.8559767Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:07.8559844Z E ^ 2025-05-07T20:33:07.8560231Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:07.8560236Z 2025-05-07T20:33:07.8560681Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:07.8560686Z 2025-05-07T20:33:07.8560793Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.8561024Z self=, 2025-05-07T20:33:07.8561100Z T=128, 2025-05-07T20:33:07.8561221Z D=5120, 2025-05-07T20:33:07.8561304Z scale_ub=1200.0, 2025-05-07T20:33:07.8561423Z contiguous=True, 2025-05-07T20:33:07.8561511Z compiled=False, 2025-05-07T20:33:07.8561583Z ) 2025-05-07T20:33:07.8561818Z self = 2025-05-07T20:33:07.8561991Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:07.8561995Z 2025-05-07T20:33:07.8562110Z @given( 2025-05-07T20:33:07.8562236Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.8562332Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.8562449Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.8562570Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.8562682Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.8562751Z ) 2025-05-07T20:33:07.8563016Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.8563116Z def test_silu_mul_quant( 2025-05-07T20:33:07.8563201Z self, 2025-05-07T20:33:07.8563276Z T: int, 2025-05-07T20:33:07.8563351Z D: int, 2025-05-07T20:33:07.8563453Z scale_ub: Optional[float], 2025-05-07T20:33:07.8563540Z contiguous: bool, 2025-05-07T20:33:07.8563624Z compiled: bool, 2025-05-07T20:33:07.8563708Z ) -> None: 2025-05-07T20:33:07.8563800Z torch.manual_seed(2025) 2025-05-07T20:33:07.8563875Z 2025-05-07T20:33:07.8564052Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.8564125Z 2025-05-07T20:33:07.8564216Z x_sign = torch.sign(x) 2025-05-07T20:33:07.8564346Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:07.8564435Z x = x_sign * x_clamp 2025-05-07T20:33:07.8564522Z x0 = x[:, :D] 2025-05-07T20:33:07.8564600Z x1 = x[:, D:] 2025-05-07T20:33:07.8564671Z 2025-05-07T20:33:07.8564758Z if contiguous: 2025-05-07T20:33:07.8564849Z x0 = x0.contiguous() 2025-05-07T20:33:07.8564940Z x1 = x1.contiguous() 2025-05-07T20:33:07.8565019Z 2025-05-07T20:33:07.8565107Z if scale_ub is not None: 2025-05-07T20:33:07.8565210Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:07.8565352Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:07.8565425Z ) 2025-05-07T20:33:07.8565499Z else: 2025-05-07T20:33:07.8565597Z scale_ub_tensor = None 2025-05-07T20:33:07.8565667Z 2025-05-07T20:33:07.8565798Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:07.8565955Z op = silu_mul_quant 2025-05-07T20:33:07.8566039Z if compiled: 2025-05-07T20:33:07.8566143Z op = torch.compile(op) 2025-05-07T20:33:07.8566245Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.8566316Z 2025-05-07T20:33:07.8566409Z > y_fp8, y_scale = fn() 2025-05-07T20:33:07.8566415Z 2025-05-07T20:33:07.8566511Z moe/activation_test.py:117: 2025-05-07T20:33:07.8566643Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.8566750Z moe/activation_test.py:115: in fn 2025-05-07T20:33:07.8566846Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.8567390Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:07.8567487Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:07.8567869Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:07.8568109Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:07.8568469Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:07.8568560Z kernel = self.compile( 2025-05-07T20:33:07.8569019Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:07.8569234Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:07.8569369Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.8569374Z 2025-05-07T20:33:07.8569583Z self = 2025-05-07T20:33:07.8570467Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:07.8571019Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd1b887cee0>} 2025-05-07T20:33:07.8571834Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:07.8572038Z context = 2025-05-07T20:33:07.8572042Z 2025-05-07T20:33:07.8572209Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:07.8572490Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:07.8572599Z module_map=module_map) 2025-05-07T20:33:07.8572760Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:07.8572865Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:07.8572939Z E ^ 2025-05-07T20:33:07.8573318Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:07.8573323Z 2025-05-07T20:33:07.8573775Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:07.8573782Z 2025-05-07T20:33:07.8573883Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.8574122Z self=, 2025-05-07T20:33:07.8574197Z T=1, 2025-05-07T20:33:07.8574271Z D=7168, 2025-05-07T20:33:07.8574361Z scale_ub=1200.0, 2025-05-07T20:33:07.8574442Z contiguous=True, 2025-05-07T20:33:07.8574525Z compiled=True, 2025-05-07T20:33:07.8574606Z ) 2025-05-07T20:33:07.8574829Z self = 2025-05-07T20:33:07.8575038Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:07.8575053Z 2025-05-07T20:33:07.8575131Z @given( 2025-05-07T20:33:07.8575249Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.8575353Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.8575469Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.8575584Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.8575705Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.8575780Z ) 2025-05-07T20:33:07.8576040Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.8576138Z def test_silu_mul_quant( 2025-05-07T20:33:07.8576211Z self, 2025-05-07T20:33:07.8576288Z T: int, 2025-05-07T20:33:07.8576370Z D: int, 2025-05-07T20:33:07.8576465Z scale_ub: Optional[float], 2025-05-07T20:33:07.8576556Z contiguous: bool, 2025-05-07T20:33:07.8576644Z compiled: bool, 2025-05-07T20:33:07.8576720Z ) -> None: 2025-05-07T20:33:07.8576816Z torch.manual_seed(2025) 2025-05-07T20:33:07.8576887Z 2025-05-07T20:33:07.8577057Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.8577133Z 2025-05-07T20:33:07.8577224Z x_sign = torch.sign(x) 2025-05-07T20:33:07.8577387Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:07.8577519Z x = x_sign * x_clamp 2025-05-07T20:33:07.8577597Z x0 = x[:, :D] 2025-05-07T20:33:07.8577676Z x1 = x[:, D:] 2025-05-07T20:33:07.8577756Z 2025-05-07T20:33:07.8577840Z if contiguous: 2025-05-07T20:33:07.8577929Z x0 = x0.contiguous() 2025-05-07T20:33:07.8578021Z x1 = x1.contiguous() 2025-05-07T20:33:07.8578135Z 2025-05-07T20:33:07.8578231Z if scale_ub is not None: 2025-05-07T20:33:07.8578333Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:07.8578471Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:07.8578548Z ) 2025-05-07T20:33:07.8578622Z else: 2025-05-07T20:33:07.8578719Z scale_ub_tensor = None 2025-05-07T20:33:07.8578797Z 2025-05-07T20:33:07.8578925Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:07.8579015Z op = silu_mul_quant 2025-05-07T20:33:07.8579105Z if compiled: 2025-05-07T20:33:07.8579207Z op = torch.compile(op) 2025-05-07T20:33:07.8579313Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.8579394Z 2025-05-07T20:33:07.8579481Z > y_fp8, y_scale = fn() 2025-05-07T20:33:07.8579485Z 2025-05-07T20:33:07.8579585Z moe/activation_test.py:117: 2025-05-07T20:33:07.8579717Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.8579820Z moe/activation_test.py:115: in fn 2025-05-07T20:33:07.8579925Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.8580320Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:07.8580411Z return fn(*args, **kwargs) 2025-05-07T20:33:07.8580955Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:07.8581051Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:07.8581440Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:07.8581671Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:07.8582032Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:07.8582134Z kernel = self.compile( 2025-05-07T20:33:07.8582541Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:07.8583047Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:07.8583252Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.8583259Z 2025-05-07T20:33:07.8583494Z self = 2025-05-07T20:33:07.8584355Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:07.8584907Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd1b8cd4940>} 2025-05-07T20:33:07.8585739Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:07.8585937Z context = 2025-05-07T20:33:07.8585941Z 2025-05-07T20:33:07.8586109Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:07.8586518Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:07.8586625Z module_map=module_map) 2025-05-07T20:33:07.8586855Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:07.8586952Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:07.8587024Z E ^ 2025-05-07T20:33:07.8587412Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:07.8587494Z 2025-05-07T20:33:07.8587940Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:07.8587944Z 2025-05-07T20:33:07.8588053Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.8588283Z self=, 2025-05-07T20:33:07.8588360Z T=1, 2025-05-07T20:33:07.8588442Z D=7168, 2025-05-07T20:33:07.8588523Z scale_ub=1200.0, 2025-05-07T20:33:07.8588609Z contiguous=False, 2025-05-07T20:33:07.8588701Z compiled=True, 2025-05-07T20:33:07.8588773Z ) 2025-05-07T20:33:07.8588997Z self = 2025-05-07T20:33:07.8589178Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:07.8589182Z 2025-05-07T20:33:07.8589259Z @given( 2025-05-07T20:33:07.8589374Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.8589477Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.8589624Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.8589890Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.8590063Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.8590174Z ) 2025-05-07T20:33:07.8590537Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.8590665Z def test_silu_mul_quant( 2025-05-07T20:33:07.8590771Z self, 2025-05-07T20:33:07.8590883Z T: int, 2025-05-07T20:33:07.8590993Z D: int, 2025-05-07T20:33:07.8591127Z scale_ub: Optional[float], 2025-05-07T20:33:07.8591262Z contiguous: bool, 2025-05-07T20:33:07.8591351Z compiled: bool, 2025-05-07T20:33:07.8591437Z ) -> None: 2025-05-07T20:33:07.8591530Z torch.manual_seed(2025) 2025-05-07T20:33:07.8591600Z 2025-05-07T20:33:07.8591776Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.8591852Z 2025-05-07T20:33:07.8591940Z x_sign = torch.sign(x) 2025-05-07T20:33:07.8592068Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:07.8592154Z x = x_sign * x_clamp 2025-05-07T20:33:07.8592331Z x0 = x[:, :D] 2025-05-07T20:33:07.8592416Z x1 = x[:, D:] 2025-05-07T20:33:07.8592503Z 2025-05-07T20:33:07.8598286Z if contiguous: 2025-05-07T20:33:07.8598407Z x0 = x0.contiguous() 2025-05-07T20:33:07.8598498Z x1 = x1.contiguous() 2025-05-07T20:33:07.8598569Z 2025-05-07T20:33:07.8598678Z if scale_ub is not None: 2025-05-07T20:33:07.8598792Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:07.8598939Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:07.8599017Z ) 2025-05-07T20:33:07.8599096Z else: 2025-05-07T20:33:07.8599200Z scale_ub_tensor = None 2025-05-07T20:33:07.8599276Z 2025-05-07T20:33:07.8599411Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:07.8599514Z op = silu_mul_quant 2025-05-07T20:33:07.8599602Z if compiled: 2025-05-07T20:33:07.8599705Z op = torch.compile(op) 2025-05-07T20:33:07.8599823Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.8599896Z 2025-05-07T20:33:07.8599987Z > y_fp8, y_scale = fn() 2025-05-07T20:33:07.8599992Z 2025-05-07T20:33:07.8600102Z moe/activation_test.py:117: 2025-05-07T20:33:07.8600307Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.8600419Z moe/activation_test.py:115: in fn 2025-05-07T20:33:07.8600565Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.8600968Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:07.8601076Z return fn(*args, **kwargs) 2025-05-07T20:33:07.8601615Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:07.8601783Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:07.8602181Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:07.8602417Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:07.8602790Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:07.8602889Z kernel = self.compile( 2025-05-07T20:33:07.8603303Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:07.8603496Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:07.8603632Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.8603636Z 2025-05-07T20:33:07.8603858Z self = 2025-05-07T20:33:07.8604721Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:07.8605271Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd1b8bfb5e0>} 2025-05-07T20:33:07.8606102Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:07.8606304Z context = 2025-05-07T20:33:07.8606309Z 2025-05-07T20:33:07.8606492Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:07.8606772Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:07.8606887Z module_map=module_map) 2025-05-07T20:33:07.8607106Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:07.8607210Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:07.8607304Z E ^ 2025-05-07T20:33:07.8607692Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:07.8607698Z 2025-05-07T20:33:07.8608148Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:07.8608155Z 2025-05-07T20:33:07.8608274Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.8608509Z self=, 2025-05-07T20:33:07.8608589Z T=1, 2025-05-07T20:33:07.8608677Z D=7168, 2025-05-07T20:33:07.8608761Z scale_ub=None, 2025-05-07T20:33:07.8608864Z contiguous=False, 2025-05-07T20:33:07.8608948Z compiled=True, 2025-05-07T20:33:07.8609023Z ) 2025-05-07T20:33:07.8609261Z self = 2025-05-07T20:33:07.8609434Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:07.8609439Z 2025-05-07T20:33:07.8609522Z @given( 2025-05-07T20:33:07.8609653Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.8609753Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.8609913Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.8610081Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.8610196Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.8610282Z ) 2025-05-07T20:33:07.8610542Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.8610638Z def test_silu_mul_quant( 2025-05-07T20:33:07.8610764Z self, 2025-05-07T20:33:07.8610844Z T: int, 2025-05-07T20:33:07.8610920Z D: int, 2025-05-07T20:33:07.8611033Z scale_ub: Optional[float], 2025-05-07T20:33:07.8611121Z contiguous: bool, 2025-05-07T20:33:07.8611208Z compiled: bool, 2025-05-07T20:33:07.8611292Z ) -> None: 2025-05-07T20:33:07.8611388Z torch.manual_seed(2025) 2025-05-07T20:33:07.8611459Z 2025-05-07T20:33:07.8611650Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.8611725Z 2025-05-07T20:33:07.8611826Z x_sign = torch.sign(x) 2025-05-07T20:33:07.8611948Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:07.8612049Z x = x_sign * x_clamp 2025-05-07T20:33:07.8612128Z x0 = x[:, :D] 2025-05-07T20:33:07.8612210Z x1 = x[:, D:] 2025-05-07T20:33:07.8612293Z 2025-05-07T20:33:07.8612376Z if contiguous: 2025-05-07T20:33:07.8612471Z x0 = x0.contiguous() 2025-05-07T20:33:07.8612571Z x1 = x1.contiguous() 2025-05-07T20:33:07.8612645Z 2025-05-07T20:33:07.8612735Z if scale_ub is not None: 2025-05-07T20:33:07.8612848Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:07.8612987Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:07.8613065Z ) 2025-05-07T20:33:07.8613150Z else: 2025-05-07T20:33:07.8613244Z scale_ub_tensor = None 2025-05-07T20:33:07.8613324Z 2025-05-07T20:33:07.8613456Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:07.8613547Z op = silu_mul_quant 2025-05-07T20:33:07.8613643Z if compiled: 2025-05-07T20:33:07.8613744Z op = torch.compile(op) 2025-05-07T20:33:07.8613848Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.8613929Z 2025-05-07T20:33:07.8614020Z y_fp8, y_scale = fn() 2025-05-07T20:33:07.8614143Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:07.8614225Z 2025-05-07T20:33:07.8614361Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:07.8614464Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:07.8614618Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:07.8614746Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:07.8614894Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:07.8614964Z 2025-05-07T20:33:07.8615064Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:07.8615069Z 2025-05-07T20:33:07.8615176Z moe/activation_test.py:126: 2025-05-07T20:33:07.8615311Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.8615415Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:07.8615556Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:07.8616171Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:07.8616281Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:07.8616671Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:07.8616906Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:07.8617309Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:07.8617619Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:07.8618091Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:33:07.8618359Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:07.8618765Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:07.8618985Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:07.8619355Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:07.8619435Z fn() 2025-05-07T20:33:07.8619872Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:07.8619954Z self.fn.run( 2025-05-07T20:33:07.8620323Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:07.8620421Z kernel = self.compile( 2025-05-07T20:33:07.8620829Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:07.8621014Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:07.8621143Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.8621151Z 2025-05-07T20:33:07.8621364Z self = 2025-05-07T20:33:07.8622229Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:07.8622810Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd1b8a4b160>} 2025-05-07T20:33:07.8623657Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:07.8623856Z context = 2025-05-07T20:33:07.8623861Z 2025-05-07T20:33:07.8624046Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:07.8624323Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:07.8624470Z module_map=module_map) 2025-05-07T20:33:07.8624643Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:07.8624747Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:07.8624821Z E ^ 2025-05-07T20:33:07.8625213Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:07.8625217Z 2025-05-07T20:33:07.8625666Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:07.8625670Z 2025-05-07T20:33:07.8625780Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.8626011Z self=, 2025-05-07T20:33:07.8626097Z T=1, 2025-05-07T20:33:07.8626181Z D=5120, 2025-05-07T20:33:07.8626267Z scale_ub=1200.0, 2025-05-07T20:33:07.8626356Z contiguous=False, 2025-05-07T20:33:07.8626450Z compiled=True, 2025-05-07T20:33:07.8626525Z ) 2025-05-07T20:33:07.8626762Z self = 2025-05-07T20:33:07.8626933Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:07.8626938Z 2025-05-07T20:33:07.8627012Z @given( 2025-05-07T20:33:07.8627233Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.8627335Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.8627491Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.8627617Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.8627730Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.8627808Z ) 2025-05-07T20:33:07.8628077Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.8628214Z def test_silu_mul_quant( 2025-05-07T20:33:07.8628296Z self, 2025-05-07T20:33:07.8628373Z T: int, 2025-05-07T20:33:07.8628450Z D: int, 2025-05-07T20:33:07.8628562Z scale_ub: Optional[float], 2025-05-07T20:33:07.8628651Z contiguous: bool, 2025-05-07T20:33:07.8628736Z compiled: bool, 2025-05-07T20:33:07.8628825Z ) -> None: 2025-05-07T20:33:07.8628919Z torch.manual_seed(2025) 2025-05-07T20:33:07.8628992Z 2025-05-07T20:33:07.8629179Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.8629258Z 2025-05-07T20:33:07.8629349Z x_sign = torch.sign(x) 2025-05-07T20:33:07.8629479Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:07.8629567Z x = x_sign * x_clamp 2025-05-07T20:33:07.8629653Z x0 = x[:, :D] 2025-05-07T20:33:07.8629732Z x1 = x[:, D:] 2025-05-07T20:33:07.8629965Z 2025-05-07T20:33:07.8630060Z if contiguous: 2025-05-07T20:33:07.8630150Z x0 = x0.contiguous() 2025-05-07T20:33:07.8630239Z x1 = x1.contiguous() 2025-05-07T20:33:07.8630317Z 2025-05-07T20:33:07.8630408Z if scale_ub is not None: 2025-05-07T20:33:07.8630513Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:07.8630656Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:07.8630732Z ) 2025-05-07T20:33:07.8630809Z else: 2025-05-07T20:33:07.8630910Z scale_ub_tensor = None 2025-05-07T20:33:07.8630985Z 2025-05-07T20:33:07.8631125Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:07.8631217Z op = silu_mul_quant 2025-05-07T20:33:07.8631300Z if compiled: 2025-05-07T20:33:07.8631406Z op = torch.compile(op) 2025-05-07T20:33:07.8631511Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.8631585Z 2025-05-07T20:33:07.8631683Z > y_fp8, y_scale = fn() 2025-05-07T20:33:07.8631690Z 2025-05-07T20:33:07.8631786Z moe/activation_test.py:117: 2025-05-07T20:33:07.8631917Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.8632078Z moe/activation_test.py:115: in fn 2025-05-07T20:33:07.8632179Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.8632581Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:07.8632673Z return fn(*args, **kwargs) 2025-05-07T20:33:07.8633215Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:07.8633321Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:07.8633703Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:07.8633936Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:07.8634307Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:07.8634401Z kernel = self.compile( 2025-05-07T20:33:07.8634821Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:07.8634999Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:07.8635130Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.8635135Z 2025-05-07T20:33:07.8635393Z self = 2025-05-07T20:33:07.8636311Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:07.8636868Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd1b8a4bb80>} 2025-05-07T20:33:07.8637721Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:07.8637918Z context = 2025-05-07T20:33:07.8637929Z 2025-05-07T20:33:07.8638102Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:07.8638377Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:07.8638492Z module_map=module_map) 2025-05-07T20:33:07.8638653Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:07.8638754Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:07.8638839Z E ^ 2025-05-07T20:33:07.8639226Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:07.8639231Z 2025-05-07T20:33:07.8639684Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:07.8639689Z 2025-05-07T20:33:07.8639793Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.8640025Z self=, 2025-05-07T20:33:07.8640110Z T=1, 2025-05-07T20:33:07.8640188Z D=5120, 2025-05-07T20:33:07.8640272Z scale_ub=1200.0, 2025-05-07T20:33:07.8640366Z contiguous=False, 2025-05-07T20:33:07.8640450Z compiled=False, 2025-05-07T20:33:07.8640524Z ) 2025-05-07T20:33:07.8640756Z self = 2025-05-07T20:33:07.8640927Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:07.8640931Z 2025-05-07T20:33:07.8641015Z @given( 2025-05-07T20:33:07.8641133Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.8641232Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.8641397Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.8641516Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.8641629Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.8641711Z ) 2025-05-07T20:33:07.8641968Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.8642064Z def test_silu_mul_quant( 2025-05-07T20:33:07.8642151Z self, 2025-05-07T20:33:07.8642229Z T: int, 2025-05-07T20:33:07.8642310Z D: int, 2025-05-07T20:33:07.8642408Z scale_ub: Optional[float], 2025-05-07T20:33:07.8642496Z contiguous: bool, 2025-05-07T20:33:07.8642588Z compiled: bool, 2025-05-07T20:33:07.8642667Z ) -> None: 2025-05-07T20:33:07.8642762Z torch.manual_seed(2025) 2025-05-07T20:33:07.8642845Z 2025-05-07T20:33:07.8643016Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.8643093Z 2025-05-07T20:33:07.8643196Z x_sign = torch.sign(x) 2025-05-07T20:33:07.8643319Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:07.8643411Z x = x_sign * x_clamp 2025-05-07T20:33:07.8643497Z x0 = x[:, :D] 2025-05-07T20:33:07.8643577Z x1 = x[:, D:] 2025-05-07T20:33:07.8643658Z 2025-05-07T20:33:07.8643741Z if contiguous: 2025-05-07T20:33:07.8643878Z x0 = x0.contiguous() 2025-05-07T20:33:07.8644009Z x1 = x1.contiguous() 2025-05-07T20:33:07.8644087Z 2025-05-07T20:33:07.8644181Z if scale_ub is not None: 2025-05-07T20:33:07.8644293Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:07.8644429Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:07.8644507Z ) 2025-05-07T20:33:07.8644631Z else: 2025-05-07T20:33:07.8644726Z scale_ub_tensor = None 2025-05-07T20:33:07.8644799Z 2025-05-07T20:33:07.8644936Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:07.8645030Z op = silu_mul_quant 2025-05-07T20:33:07.8645113Z if compiled: 2025-05-07T20:33:07.8645221Z op = torch.compile(op) 2025-05-07T20:33:07.8645325Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.8645405Z 2025-05-07T20:33:07.8645496Z > y_fp8, y_scale = fn() 2025-05-07T20:33:07.8645504Z 2025-05-07T20:33:07.8645600Z moe/activation_test.py:117: 2025-05-07T20:33:07.8645740Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.8645842Z moe/activation_test.py:115: in fn 2025-05-07T20:33:07.8645940Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.8646485Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:07.8646583Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:07.8646970Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:07.8647207Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:07.8647569Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:07.8647666Z kernel = self.compile( 2025-05-07T20:33:07.8648076Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:07.8648257Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:07.8648395Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.8648400Z 2025-05-07T20:33:07.8648609Z self = 2025-05-07T20:33:07.8649539Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:07.8650090Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd1b8b57550>} 2025-05-07T20:33:07.8650914Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:07.8651115Z context = 2025-05-07T20:33:07.8651120Z 2025-05-07T20:33:07.8651288Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:07.8651577Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:07.8651688Z module_map=module_map) 2025-05-07T20:33:07.8651857Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:07.8651958Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:07.8652032Z E ^ 2025-05-07T20:33:07.8652420Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:07.8652425Z 2025-05-07T20:33:07.8652910Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:07.8652952Z 2025-05-07T20:33:07.8653057Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.8653297Z self=, 2025-05-07T20:33:07.8653373Z T=16384, 2025-05-07T20:33:07.8653456Z D=5120, 2025-05-07T20:33:07.8653538Z scale_ub=1200.0, 2025-05-07T20:33:07.8653625Z contiguous=False, 2025-05-07T20:33:07.8653753Z compiled=True, 2025-05-07T20:33:07.8653826Z ) 2025-05-07T20:33:07.8654053Z self = 2025-05-07T20:33:07.8654255Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:07.8654259Z 2025-05-07T20:33:07.8654335Z @given( 2025-05-07T20:33:07.8654454Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.8654560Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.8654678Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.8654801Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.8654918Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.8654995Z ) 2025-05-07T20:33:07.8655262Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.8655355Z def test_silu_mul_quant( 2025-05-07T20:33:07.8655431Z self, 2025-05-07T20:33:07.8655521Z T: int, 2025-05-07T20:33:07.8655601Z D: int, 2025-05-07T20:33:07.8655700Z scale_ub: Optional[float], 2025-05-07T20:33:07.8655794Z contiguous: bool, 2025-05-07T20:33:07.8655882Z compiled: bool, 2025-05-07T20:33:07.8655960Z ) -> None: 2025-05-07T20:33:07.8656061Z torch.manual_seed(2025) 2025-05-07T20:33:07.8656134Z 2025-05-07T20:33:07.8656312Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.8656388Z 2025-05-07T20:33:07.8656483Z x_sign = torch.sign(x) 2025-05-07T20:33:07.8656615Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:07.8656708Z x = x_sign * x_clamp 2025-05-07T20:33:07.8656788Z x0 = x[:, :D] 2025-05-07T20:33:07.8656876Z x1 = x[:, D:] 2025-05-07T20:33:07.8656948Z 2025-05-07T20:33:07.8657031Z if contiguous: 2025-05-07T20:33:07.8657128Z x0 = x0.contiguous() 2025-05-07T20:33:07.8657217Z x1 = x1.contiguous() 2025-05-07T20:33:07.8657291Z 2025-05-07T20:33:07.8657390Z if scale_ub is not None: 2025-05-07T20:33:07.8657497Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:07.8657688Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:07.8657766Z ) 2025-05-07T20:33:07.8657844Z else: 2025-05-07T20:33:07.8657942Z scale_ub_tensor = None 2025-05-07T20:33:07.8658013Z 2025-05-07T20:33:07.8658142Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:07.8658240Z op = silu_mul_quant 2025-05-07T20:33:07.8658328Z if compiled: 2025-05-07T20:33:07.8658431Z op = torch.compile(op) 2025-05-07T20:33:07.8658542Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.8658616Z 2025-05-07T20:33:07.8658708Z > y_fp8, y_scale = fn() 2025-05-07T20:33:07.8658720Z 2025-05-07T20:33:07.8658815Z moe/activation_test.py:117: 2025-05-07T20:33:07.8658947Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.8659054Z moe/activation_test.py:115: in fn 2025-05-07T20:33:07.8659152Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.8659546Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:07.8659645Z return fn(*args, **kwargs) 2025-05-07T20:33:07.8660183Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:07.8660320Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:07.8660748Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:07.8660981Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:07.8661345Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:07.8661494Z kernel = self.compile( 2025-05-07T20:33:07.8661903Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:07.8662085Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:07.8662224Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.8662228Z 2025-05-07T20:33:07.8662441Z self = 2025-05-07T20:33:07.8663297Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:07.8663846Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd1b84dd1f0>} 2025-05-07T20:33:07.8664670Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:07.8664866Z context = 2025-05-07T20:33:07.8664870Z 2025-05-07T20:33:07.8665039Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:07.8665326Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:07.8665433Z module_map=module_map) 2025-05-07T20:33:07.8665603Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:07.8665702Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:07.8665776Z E ^ 2025-05-07T20:33:07.8666165Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:07.8666172Z 2025-05-07T20:33:07.8666618Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:07.8666623Z 2025-05-07T20:33:07.8666766Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.8667008Z self=, 2025-05-07T20:33:07.8667083Z T=2048, 2025-05-07T20:33:07.8667168Z D=7168, 2025-05-07T20:33:07.8667250Z scale_ub=1200.0, 2025-05-07T20:33:07.8667338Z contiguous=False, 2025-05-07T20:33:07.8667429Z compiled=True, 2025-05-07T20:33:07.8667503Z ) 2025-05-07T20:33:07.8667732Z self = 2025-05-07T20:33:07.8667918Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:07.8667923Z 2025-05-07T20:33:07.8667999Z @given( 2025-05-07T20:33:07.8668118Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.8668229Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.8668346Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.8668470Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.8668585Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.8668660Z ) 2025-05-07T20:33:07.8668926Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.8669019Z def test_silu_mul_quant( 2025-05-07T20:33:07.8669092Z self, 2025-05-07T20:33:07.8669174Z T: int, 2025-05-07T20:33:07.8669321Z D: int, 2025-05-07T20:33:07.8669420Z scale_ub: Optional[float], 2025-05-07T20:33:07.8669556Z contiguous: bool, 2025-05-07T20:33:07.8669641Z compiled: bool, 2025-05-07T20:33:07.8669717Z ) -> None: 2025-05-07T20:33:07.8669935Z torch.manual_seed(2025) 2025-05-07T20:33:07.8670008Z 2025-05-07T20:33:07.8670185Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.8670305Z 2025-05-07T20:33:07.8670397Z x_sign = torch.sign(x) 2025-05-07T20:33:07.8670528Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:07.8670618Z x = x_sign * x_clamp 2025-05-07T20:33:07.8670698Z x0 = x[:, :D] 2025-05-07T20:33:07.8670787Z x1 = x[:, D:] 2025-05-07T20:33:07.8670860Z 2025-05-07T20:33:07.8670944Z if contiguous: 2025-05-07T20:33:07.8671042Z x0 = x0.contiguous() 2025-05-07T20:33:07.8671131Z x1 = x1.contiguous() 2025-05-07T20:33:07.8671208Z 2025-05-07T20:33:07.8671306Z if scale_ub is not None: 2025-05-07T20:33:07.8671414Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:07.8671555Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:07.8671631Z ) 2025-05-07T20:33:07.8671708Z else: 2025-05-07T20:33:07.8671809Z scale_ub_tensor = None 2025-05-07T20:33:07.8671880Z 2025-05-07T20:33:07.8672014Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:07.8672112Z op = silu_mul_quant 2025-05-07T20:33:07.8672199Z if compiled: 2025-05-07T20:33:07.8672298Z op = torch.compile(op) 2025-05-07T20:33:07.8672411Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.8672484Z 2025-05-07T20:33:07.8672574Z > y_fp8, y_scale = fn() 2025-05-07T20:33:07.8672586Z 2025-05-07T20:33:07.8672682Z moe/activation_test.py:117: 2025-05-07T20:33:07.8672819Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.8672929Z moe/activation_test.py:115: in fn 2025-05-07T20:33:07.8673030Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.8673421Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:07.8673519Z return fn(*args, **kwargs) 2025-05-07T20:33:07.8674058Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:07.8674159Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:07.8674590Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:07.8674826Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:07.8675195Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:07.8675293Z kernel = self.compile( 2025-05-07T20:33:07.8675703Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:07.8675893Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:07.8676022Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.8676027Z 2025-05-07T20:33:07.8676244Z self = 2025-05-07T20:33:07.8677096Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:07.8677643Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd1b84ddee0>} 2025-05-07T20:33:07.8678502Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:07.8678737Z context = 2025-05-07T20:33:07.8678742Z 2025-05-07T20:33:07.8678918Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:07.8679237Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:07.8679345Z module_map=module_map) 2025-05-07T20:33:07.8679517Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:07.8679618Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:07.8679704Z E ^ 2025-05-07T20:33:07.8680087Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:07.8680091Z 2025-05-07T20:33:07.8680539Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:07.8680546Z 2025-05-07T20:33:07.8680659Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.8680889Z self=, 2025-05-07T20:33:07.8680976Z T=1, 2025-05-07T20:33:07.8681054Z D=5120, 2025-05-07T20:33:07.8681139Z scale_ub=None, 2025-05-07T20:33:07.8681234Z contiguous=False, 2025-05-07T20:33:07.8681318Z compiled=False, 2025-05-07T20:33:07.8681393Z ) 2025-05-07T20:33:07.8681628Z self = 2025-05-07T20:33:07.8681801Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:07.8681806Z 2025-05-07T20:33:07.8681882Z @given( 2025-05-07T20:33:07.8682014Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.8682113Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.8682232Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.8682362Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.8682474Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.8682553Z ) 2025-05-07T20:33:07.8683230Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.8683365Z def test_silu_mul_quant( 2025-05-07T20:33:07.8683464Z self, 2025-05-07T20:33:07.8683543Z T: int, 2025-05-07T20:33:07.8683625Z D: int, 2025-05-07T20:33:07.8683733Z scale_ub: Optional[float], 2025-05-07T20:33:07.8684012Z contiguous: bool, 2025-05-07T20:33:07.8684100Z compiled: bool, 2025-05-07T20:33:07.8684187Z ) -> None: 2025-05-07T20:33:07.8684280Z torch.manual_seed(2025) 2025-05-07T20:33:07.8684354Z 2025-05-07T20:33:07.8684533Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.8684606Z 2025-05-07T20:33:07.8684707Z x_sign = torch.sign(x) 2025-05-07T20:33:07.8684833Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:07.8684919Z x = x_sign * x_clamp 2025-05-07T20:33:07.8685004Z x0 = x[:, :D] 2025-05-07T20:33:07.8685084Z x1 = x[:, D:] 2025-05-07T20:33:07.8685158Z 2025-05-07T20:33:07.8685246Z if contiguous: 2025-05-07T20:33:07.8685336Z x0 = x0.contiguous() 2025-05-07T20:33:07.8685428Z x1 = x1.contiguous() 2025-05-07T20:33:07.8685503Z 2025-05-07T20:33:07.8685594Z if scale_ub is not None: 2025-05-07T20:33:07.8685701Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:07.8685843Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:07.8685918Z ) 2025-05-07T20:33:07.8686000Z else: 2025-05-07T20:33:07.8686093Z scale_ub_tensor = None 2025-05-07T20:33:07.8686167Z 2025-05-07T20:33:07.8686369Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:07.8686459Z op = silu_mul_quant 2025-05-07T20:33:07.8686602Z if compiled: 2025-05-07T20:33:07.8686707Z op = torch.compile(op) 2025-05-07T20:33:07.8686812Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.8686886Z 2025-05-07T20:33:07.8686985Z > y_fp8, y_scale = fn() 2025-05-07T20:33:07.8686989Z 2025-05-07T20:33:07.8687085Z moe/activation_test.py:117: 2025-05-07T20:33:07.8687284Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.8687390Z moe/activation_test.py:115: in fn 2025-05-07T20:33:07.8687491Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.8688041Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:07.8688138Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:07.8688524Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:07.8688765Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:07.8689128Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:07.8689229Z kernel = self.compile( 2025-05-07T20:33:07.8689639Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:07.8689821Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:07.8689958Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.8689963Z 2025-05-07T20:33:07.8690172Z self = 2025-05-07T20:33:07.8691024Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:07.8691574Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd1b8ad85e0>} 2025-05-07T20:33:07.8692386Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:07.8692593Z context = 2025-05-07T20:33:07.8692597Z 2025-05-07T20:33:07.8692811Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:07.8693099Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:07.8693206Z module_map=module_map) 2025-05-07T20:33:07.8693371Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:07.8693478Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:07.8693556Z E ^ 2025-05-07T20:33:07.8693938Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:07.8693951Z 2025-05-07T20:33:07.8694395Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:07.8694402Z 2025-05-07T20:33:07.8694505Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.8694744Z self=, 2025-05-07T20:33:07.8694826Z T=4096, 2025-05-07T20:33:07.8694904Z D=7168, 2025-05-07T20:33:07.8694992Z scale_ub=1200.0, 2025-05-07T20:33:07.8695080Z contiguous=False, 2025-05-07T20:33:07.8695164Z compiled=False, 2025-05-07T20:33:07.8695247Z ) 2025-05-07T20:33:07.8695515Z self = 2025-05-07T20:33:07.8695708Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:07.8695750Z 2025-05-07T20:33:07.8695829Z @given( 2025-05-07T20:33:07.8695951Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.8696057Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.8696173Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.8696331Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.8696452Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.8696529Z ) 2025-05-07T20:33:07.8696790Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.8696894Z def test_silu_mul_quant( 2025-05-07T20:33:07.8696976Z self, 2025-05-07T20:33:07.8697065Z T: int, 2025-05-07T20:33:07.8697143Z D: int, 2025-05-07T20:33:07.8697241Z scale_ub: Optional[float], 2025-05-07T20:33:07.8697347Z contiguous: bool, 2025-05-07T20:33:07.8697436Z compiled: bool, 2025-05-07T20:33:07.8697516Z ) -> None: 2025-05-07T20:33:07.8697619Z torch.manual_seed(2025) 2025-05-07T20:33:07.8697695Z 2025-05-07T20:33:07.8697869Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.8697952Z 2025-05-07T20:33:07.8698046Z x_sign = torch.sign(x) 2025-05-07T20:33:07.8698170Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:07.8698270Z x = x_sign * x_clamp 2025-05-07T20:33:07.8698352Z x0 = x[:, :D] 2025-05-07T20:33:07.8698439Z x1 = x[:, D:] 2025-05-07T20:33:07.8698512Z 2025-05-07T20:33:07.8698598Z if contiguous: 2025-05-07T20:33:07.8698697Z x0 = x0.contiguous() 2025-05-07T20:33:07.8698787Z x1 = x1.contiguous() 2025-05-07T20:33:07.8698861Z 2025-05-07T20:33:07.8698959Z if scale_ub is not None: 2025-05-07T20:33:07.8699063Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:07.8699201Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:07.8699290Z ) 2025-05-07T20:33:07.8699367Z else: 2025-05-07T20:33:07.8699460Z scale_ub_tensor = None 2025-05-07T20:33:07.8699540Z 2025-05-07T20:33:07.8699669Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:07.8699758Z op = silu_mul_quant 2025-05-07T20:33:07.8699852Z if compiled: 2025-05-07T20:33:07.8699955Z op = torch.compile(op) 2025-05-07T20:33:07.8700068Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.8700139Z 2025-05-07T20:33:07.8700311Z > y_fp8, y_scale = fn() 2025-05-07T20:33:07.8700316Z 2025-05-07T20:33:07.8700421Z moe/activation_test.py:117: 2025-05-07T20:33:07.8700553Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.8700655Z moe/activation_test.py:115: in fn 2025-05-07T20:33:07.8700766Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.8701304Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:07.8701410Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:07.8701792Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:07.8702024Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:07.8702395Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:07.8702491Z kernel = self.compile( 2025-05-07T20:33:07.8702902Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:07.8703088Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:07.8703285Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.8703290Z 2025-05-07T20:33:07.8703545Z self = 2025-05-07T20:33:07.8704394Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:07.8704980Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd1b8c291f0>} 2025-05-07T20:33:07.8705802Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:07.8706002Z context = 2025-05-07T20:33:07.8706006Z 2025-05-07T20:33:07.8706184Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:07.8706467Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:07.8706582Z module_map=module_map) 2025-05-07T20:33:07.8706743Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:07.8706843Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:07.8706932Z E ^ 2025-05-07T20:33:07.8707309Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:07.8707314Z 2025-05-07T20:33:07.8707762Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:07.8707767Z 2025-05-07T20:33:07.8707873Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.8708106Z self=, 2025-05-07T20:33:07.8708196Z T=16384, 2025-05-07T20:33:07.8708273Z D=7168, 2025-05-07T20:33:07.8708357Z scale_ub=None, 2025-05-07T20:33:07.8708448Z contiguous=True, 2025-05-07T20:33:07.8708530Z compiled=True, 2025-05-07T20:33:07.8708604Z ) 2025-05-07T20:33:07.8708836Z self = 2025-05-07T20:33:07.8709015Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:07.8709022Z 2025-05-07T20:33:07.8709097Z @given( 2025-05-07T20:33:07.8709225Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.8709323Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.8709490Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.8709609Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.8709723Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.8709932Z ) 2025-05-07T20:33:07.8710196Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.8710290Z def test_silu_mul_quant( 2025-05-07T20:33:07.8710376Z self, 2025-05-07T20:33:07.8710453Z T: int, 2025-05-07T20:33:07.8710531Z D: int, 2025-05-07T20:33:07.8710641Z scale_ub: Optional[float], 2025-05-07T20:33:07.8710729Z contiguous: bool, 2025-05-07T20:33:07.8710813Z compiled: bool, 2025-05-07T20:33:07.8710904Z ) -> None: 2025-05-07T20:33:07.8710998Z torch.manual_seed(2025) 2025-05-07T20:33:07.8711078Z 2025-05-07T20:33:07.8711250Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.8711324Z 2025-05-07T20:33:07.8711425Z x_sign = torch.sign(x) 2025-05-07T20:33:07.8711551Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:07.8711641Z x = x_sign * x_clamp 2025-05-07T20:33:07.8711728Z x0 = x[:, :D] 2025-05-07T20:33:07.8711809Z x1 = x[:, D:] 2025-05-07T20:33:07.8711881Z 2025-05-07T20:33:07.8712017Z if contiguous: 2025-05-07T20:33:07.8712903Z x0 = x0.contiguous() 2025-05-07T20:33:07.8712992Z x1 = x1.contiguous() 2025-05-07T20:33:07.8713078Z 2025-05-07T20:33:07.8713170Z if scale_ub is not None: 2025-05-07T20:33:07.8713278Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:07.8713425Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:07.8713546Z ) 2025-05-07T20:33:07.8713630Z else: 2025-05-07T20:33:07.8713725Z scale_ub_tensor = None 2025-05-07T20:33:07.8713800Z 2025-05-07T20:33:07.8713940Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:07.8714033Z op = silu_mul_quant 2025-05-07T20:33:07.8714118Z if compiled: 2025-05-07T20:33:07.8714224Z op = torch.compile(op) 2025-05-07T20:33:07.8714330Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.8714404Z 2025-05-07T20:33:07.8714505Z > y_fp8, y_scale = fn() 2025-05-07T20:33:07.8714510Z 2025-05-07T20:33:07.8714610Z moe/activation_test.py:117: 2025-05-07T20:33:07.8714750Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.8714851Z moe/activation_test.py:115: in fn 2025-05-07T20:33:07.8714951Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.8715352Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:07.8715449Z return fn(*args, **kwargs) 2025-05-07T20:33:07.8715987Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:07.8716090Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:07.8716471Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:07.8716711Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:07.8717073Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:07.8717169Z kernel = self.compile( 2025-05-07T20:33:07.8717586Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:07.8717764Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:07.8717897Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.8717907Z 2025-05-07T20:33:07.8718169Z self = 2025-05-07T20:33:07.8719021Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:07.8719582Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd1b8c29ee0>} 2025-05-07T20:33:07.8720396Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:07.8720595Z context = 2025-05-07T20:33:07.8720602Z 2025-05-07T20:33:07.8720771Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:07.8721048Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:07.8721163Z module_map=module_map) 2025-05-07T20:33:07.8721325Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:07.8721435Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:07.8721511Z E ^ 2025-05-07T20:33:07.8721935Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:07.8721978Z 2025-05-07T20:33:07.8722432Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:07.8722437Z 2025-05-07T20:33:07.8722538Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.8722813Z self=, 2025-05-07T20:33:07.8722896Z T=4096, 2025-05-07T20:33:07.8722976Z D=5120, 2025-05-07T20:33:07.8723062Z scale_ub=None, 2025-05-07T20:33:07.8723153Z contiguous=False, 2025-05-07T20:33:07.8723238Z compiled=True, 2025-05-07T20:33:07.8723316Z ) 2025-05-07T20:33:07.8723543Z self = 2025-05-07T20:33:07.8723720Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:07.8723724Z 2025-05-07T20:33:07.8723809Z @given( 2025-05-07T20:33:07.8723928Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.8724026Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.8724150Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.8724266Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.8724385Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.8724459Z ) 2025-05-07T20:33:07.8724716Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.8724814Z def test_silu_mul_quant( 2025-05-07T20:33:07.8724891Z self, 2025-05-07T20:33:07.8724968Z T: int, 2025-05-07T20:33:07.8725050Z D: int, 2025-05-07T20:33:07.8725148Z scale_ub: Optional[float], 2025-05-07T20:33:07.8725237Z contiguous: bool, 2025-05-07T20:33:07.8725332Z compiled: bool, 2025-05-07T20:33:07.8725408Z ) -> None: 2025-05-07T20:33:07.8725503Z torch.manual_seed(2025) 2025-05-07T20:33:07.8725584Z 2025-05-07T20:33:07.8725758Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.8725842Z 2025-05-07T20:33:07.8725935Z x_sign = torch.sign(x) 2025-05-07T20:33:07.8726073Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:07.8730797Z x = x_sign * x_clamp 2025-05-07T20:33:07.8730901Z x0 = x[:, :D] 2025-05-07T20:33:07.8730993Z x1 = x[:, D:] 2025-05-07T20:33:07.8731078Z 2025-05-07T20:33:07.8731167Z if contiguous: 2025-05-07T20:33:07.8731261Z x0 = x0.contiguous() 2025-05-07T20:33:07.8731448Z x1 = x1.contiguous() 2025-05-07T20:33:07.8731524Z 2025-05-07T20:33:07.8731622Z if scale_ub is not None: 2025-05-07T20:33:07.8731743Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:07.8731881Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:07.8731961Z ) 2025-05-07T20:33:07.8732051Z else: 2025-05-07T20:33:07.8732147Z scale_ub_tensor = None 2025-05-07T20:33:07.8732236Z 2025-05-07T20:33:07.8732373Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:07.8732468Z op = silu_mul_quant 2025-05-07T20:33:07.8732569Z if compiled: 2025-05-07T20:33:07.8732672Z op = torch.compile(op) 2025-05-07T20:33:07.8732780Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.8732865Z 2025-05-07T20:33:07.8732957Z > y_fp8, y_scale = fn() 2025-05-07T20:33:07.8732962Z 2025-05-07T20:33:07.8733062Z moe/activation_test.py:117: 2025-05-07T20:33:07.8733211Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.8733313Z moe/activation_test.py:115: in fn 2025-05-07T20:33:07.8733424Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.8733882Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:07.8733981Z return fn(*args, **kwargs) 2025-05-07T20:33:07.8734573Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:07.8734677Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:07.8735063Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:07.8735387Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:07.8735755Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:07.8735863Z kernel = self.compile( 2025-05-07T20:33:07.8736282Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:07.8736463Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:07.8736610Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.8736618Z 2025-05-07T20:33:07.8736835Z self = 2025-05-07T20:33:07.8737698Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:07.8738252Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd1b8987940>} 2025-05-07T20:33:07.8739070Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:07.8739279Z context = 2025-05-07T20:33:07.8739283Z 2025-05-07T20:33:07.8739457Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:07.8739743Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:07.8739855Z module_map=module_map) 2025-05-07T20:33:07.8740023Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:07.8740136Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:07.8740221Z E ^ 2025-05-07T20:33:07.8740606Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:07.8740664Z 2025-05-07T20:33:07.8741117Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:07.8741122Z 2025-05-07T20:33:07.8741226Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.8741472Z self=, 2025-05-07T20:33:07.8741552Z T=4096, 2025-05-07T20:33:07.8741634Z D=5120, 2025-05-07T20:33:07.8741726Z scale_ub=1200.0, 2025-05-07T20:33:07.8741816Z contiguous=False, 2025-05-07T20:33:07.8741903Z compiled=False, 2025-05-07T20:33:07.8741988Z ) 2025-05-07T20:33:07.8742217Z self = 2025-05-07T20:33:07.8742408Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:07.8742416Z 2025-05-07T20:33:07.8742495Z @given( 2025-05-07T20:33:07.8742617Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.8742728Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.8742846Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.8742965Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.8743089Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.8743167Z ) 2025-05-07T20:33:07.8743479Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.8743610Z def test_silu_mul_quant( 2025-05-07T20:33:07.8743691Z self, 2025-05-07T20:33:07.8743776Z T: int, 2025-05-07T20:33:07.8743855Z D: int, 2025-05-07T20:33:07.8743955Z scale_ub: Optional[float], 2025-05-07T20:33:07.8744052Z contiguous: bool, 2025-05-07T20:33:07.8744138Z compiled: bool, 2025-05-07T20:33:07.8744260Z ) -> None: 2025-05-07T20:33:07.8744364Z torch.manual_seed(2025) 2025-05-07T20:33:07.8744438Z 2025-05-07T20:33:07.8744625Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.8744702Z 2025-05-07T20:33:07.8744798Z x_sign = torch.sign(x) 2025-05-07T20:33:07.8744929Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:07.8745021Z x = x_sign * x_clamp 2025-05-07T20:33:07.8745115Z x0 = x[:, :D] 2025-05-07T20:33:07.8745201Z x1 = x[:, D:] 2025-05-07T20:33:07.8745279Z 2025-05-07T20:33:07.8745376Z if contiguous: 2025-05-07T20:33:07.8745468Z x0 = x0.contiguous() 2025-05-07T20:33:07.8745559Z x1 = x1.contiguous() 2025-05-07T20:33:07.8745638Z 2025-05-07T20:33:07.8745730Z if scale_ub is not None: 2025-05-07T20:33:07.8745835Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:07.8745979Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:07.8746059Z ) 2025-05-07T20:33:07.8746134Z else: 2025-05-07T20:33:07.8746240Z scale_ub_tensor = None 2025-05-07T20:33:07.8746315Z 2025-05-07T20:33:07.8746448Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:07.8746546Z op = silu_mul_quant 2025-05-07T20:33:07.8746634Z if compiled: 2025-05-07T20:33:07.8746746Z op = torch.compile(op) 2025-05-07T20:33:07.8746851Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.8746930Z 2025-05-07T20:33:07.8747025Z > y_fp8, y_scale = fn() 2025-05-07T20:33:07.8747032Z 2025-05-07T20:33:07.8747130Z moe/activation_test.py:117: 2025-05-07T20:33:07.8747263Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.8747376Z moe/activation_test.py:115: in fn 2025-05-07T20:33:07.8747478Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.8748021Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:07.8748126Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:07.8748559Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:07.8748803Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:07.8749168Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:07.8749262Z kernel = self.compile( 2025-05-07T20:33:07.8749681Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:07.8750005Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:07.8750143Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.8750150Z 2025-05-07T20:33:07.8750363Z self = 2025-05-07T20:33:07.8751215Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:07.8751809Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd1b872f3a0>} 2025-05-07T20:33:07.8752625Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:07.8752866Z context = 2025-05-07T20:33:07.8752871Z 2025-05-07T20:33:07.8753041Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:07.8753358Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:07.8753476Z module_map=module_map) 2025-05-07T20:33:07.8753640Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:07.8753748Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:07.8753822Z E ^ 2025-05-07T20:33:07.8754206Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:07.8754213Z 2025-05-07T20:33:07.8754666Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:07.8754674Z 2025-05-07T20:33:07.8754777Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.8755014Z self=, 2025-05-07T20:33:07.8755094Z T=4096, 2025-05-07T20:33:07.8755176Z D=5120, 2025-05-07T20:33:07.8755266Z scale_ub=1200.0, 2025-05-07T20:33:07.8755352Z contiguous=False, 2025-05-07T20:33:07.8755438Z compiled=True, 2025-05-07T20:33:07.8755519Z ) 2025-05-07T20:33:07.8755748Z self = 2025-05-07T20:33:07.8755929Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:07.8755934Z 2025-05-07T20:33:07.8756018Z @given( 2025-05-07T20:33:07.8756138Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.8756246Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.8756366Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.8756488Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.8756606Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.8756683Z ) 2025-05-07T20:33:07.8756942Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.8757047Z def test_silu_mul_quant( 2025-05-07T20:33:07.8757125Z self, 2025-05-07T20:33:07.8757205Z T: int, 2025-05-07T20:33:07.8757289Z D: int, 2025-05-07T20:33:07.8757437Z scale_ub: Optional[float], 2025-05-07T20:33:07.8757529Z contiguous: bool, 2025-05-07T20:33:07.8757622Z compiled: bool, 2025-05-07T20:33:07.8757699Z ) -> None: 2025-05-07T20:33:07.8757802Z torch.manual_seed(2025) 2025-05-07T20:33:07.8757877Z 2025-05-07T20:33:07.8758055Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.8758135Z 2025-05-07T20:33:07.8758230Z x_sign = torch.sign(x) 2025-05-07T20:33:07.8758354Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:07.8758451Z x = x_sign * x_clamp 2025-05-07T20:33:07.8758533Z x0 = x[:, :D] 2025-05-07T20:33:07.8758616Z x1 = x[:, D:] 2025-05-07T20:33:07.8758699Z 2025-05-07T20:33:07.8758783Z if contiguous: 2025-05-07T20:33:07.8758878Z x0 = x0.contiguous() 2025-05-07T20:33:07.8758975Z x1 = x1.contiguous() 2025-05-07T20:33:07.8759052Z 2025-05-07T20:33:07.8759152Z if scale_ub is not None: 2025-05-07T20:33:07.8759261Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:07.8759398Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:07.8759479Z ) 2025-05-07T20:33:07.8759555Z else: 2025-05-07T20:33:07.8759650Z scale_ub_tensor = None 2025-05-07T20:33:07.8759733Z 2025-05-07T20:33:07.8759909Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:07.8760037Z op = silu_mul_quant 2025-05-07T20:33:07.8760129Z if compiled: 2025-05-07T20:33:07.8760231Z op = torch.compile(op) 2025-05-07T20:33:07.8760337Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.8760418Z 2025-05-07T20:33:07.8760510Z > y_fp8, y_scale = fn() 2025-05-07T20:33:07.8760553Z 2025-05-07T20:33:07.8760656Z moe/activation_test.py:117: 2025-05-07T20:33:07.8760787Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.8760893Z moe/activation_test.py:115: in fn 2025-05-07T20:33:07.8761000Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.8761395Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:07.8761490Z return fn(*args, **kwargs) 2025-05-07T20:33:07.8762035Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:07.8762136Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:07.8762526Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:07.8762760Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:07.8763130Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:07.8763234Z kernel = self.compile( 2025-05-07T20:33:07.8763645Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:07.8763825Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:07.8763959Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.8763964Z 2025-05-07T20:33:07.8764176Z self = 2025-05-07T20:33:07.8765031Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:07.8765579Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd1b872f280>} 2025-05-07T20:33:07.8766502Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:07.8766700Z context = 2025-05-07T20:33:07.8766705Z 2025-05-07T20:33:07.8766875Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:07.8767159Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:07.8767271Z module_map=module_map) 2025-05-07T20:33:07.8767442Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:07.8767543Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:07.8767622Z E ^ 2025-05-07T20:33:07.8768011Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:07.8768018Z 2025-05-07T20:33:07.8768466Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:07.8768471Z 2025-05-07T20:33:07.8768575Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.8768814Z self=, 2025-05-07T20:33:07.8768892Z T=2048, 2025-05-07T20:33:07.8768974Z D=7168, 2025-05-07T20:33:07.8769097Z scale_ub=1200.0, 2025-05-07T20:33:07.8769184Z contiguous=False, 2025-05-07T20:33:07.8769344Z compiled=False, 2025-05-07T20:33:07.8769419Z ) 2025-05-07T20:33:07.8769646Z self = 2025-05-07T20:33:07.8769835Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:07.8769840Z 2025-05-07T20:33:07.8769919Z @given( 2025-05-07T20:33:07.8770077Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.8770184Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.8770299Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.8770425Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.8770538Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.8770614Z ) 2025-05-07T20:33:07.8770879Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.8770977Z def test_silu_mul_quant( 2025-05-07T20:33:07.8771054Z self, 2025-05-07T20:33:07.8771145Z T: int, 2025-05-07T20:33:07.8771224Z D: int, 2025-05-07T20:33:07.8771324Z scale_ub: Optional[float], 2025-05-07T20:33:07.8771420Z contiguous: bool, 2025-05-07T20:33:07.8771504Z compiled: bool, 2025-05-07T20:33:07.8771583Z ) -> None: 2025-05-07T20:33:07.8771684Z torch.manual_seed(2025) 2025-05-07T20:33:07.8771762Z 2025-05-07T20:33:07.8771943Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.8772018Z 2025-05-07T20:33:07.8772110Z x_sign = torch.sign(x) 2025-05-07T20:33:07.8772243Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:07.8772333Z x = x_sign * x_clamp 2025-05-07T20:33:07.8772416Z x0 = x[:, :D] 2025-05-07T20:33:07.8772502Z x1 = x[:, D:] 2025-05-07T20:33:07.8772577Z 2025-05-07T20:33:07.8772662Z if contiguous: 2025-05-07T20:33:07.8772764Z x0 = x0.contiguous() 2025-05-07T20:33:07.8772856Z x1 = x1.contiguous() 2025-05-07T20:33:07.8772931Z 2025-05-07T20:33:07.8773034Z if scale_ub is not None: 2025-05-07T20:33:07.8773142Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:07.8773284Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:07.8773362Z ) 2025-05-07T20:33:07.8773444Z else: 2025-05-07T20:33:07.8773549Z scale_ub_tensor = None 2025-05-07T20:33:07.8773620Z 2025-05-07T20:33:07.8773750Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:07.8773850Z op = silu_mul_quant 2025-05-07T20:33:07.8773993Z if compiled: 2025-05-07T20:33:07.8774099Z op = torch.compile(op) 2025-05-07T20:33:07.8774215Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.8774293Z 2025-05-07T20:33:07.8774386Z > y_fp8, y_scale = fn() 2025-05-07T20:33:07.8774390Z 2025-05-07T20:33:07.8774499Z moe/activation_test.py:117: 2025-05-07T20:33:07.8774631Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.8774742Z moe/activation_test.py:115: in fn 2025-05-07T20:33:07.8774844Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.8775385Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:07.8775492Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:07.8775877Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:07.8776112Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:07.8776483Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:07.8776580Z kernel = self.compile( 2025-05-07T20:33:07.8777043Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:07.8777258Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:07.8777389Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.8777393Z 2025-05-07T20:33:07.8777612Z self = 2025-05-07T20:33:07.8778464Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:07.8779059Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd1b856d670>} 2025-05-07T20:33:07.8779876Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:07.8780073Z context = 2025-05-07T20:33:07.8780084Z 2025-05-07T20:33:07.8780253Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:07.8780528Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:07.8780645Z module_map=module_map) 2025-05-07T20:33:07.8780811Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:07.8780913Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:07.8780998Z E ^ 2025-05-07T20:33:07.8781377Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:07.8781382Z 2025-05-07T20:33:07.8781835Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:07.8781839Z 2025-05-07T20:33:07.8781943Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.8782175Z self=, 2025-05-07T20:33:07.8782261Z T=1, 2025-05-07T20:33:07.8782340Z D=7168, 2025-05-07T20:33:07.8782425Z scale_ub=None, 2025-05-07T20:33:07.8782517Z contiguous=True, 2025-05-07T20:33:07.8782602Z compiled=False, 2025-05-07T20:33:07.8782678Z ) 2025-05-07T20:33:07.8783268Z self = 2025-05-07T20:33:07.8783670Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:07.8783676Z 2025-05-07T20:33:07.8783766Z @given( 2025-05-07T20:33:07.8783887Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.8783988Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.8784112Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.8784236Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.8784359Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.8784446Z ) 2025-05-07T20:33:07.8784707Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.8784815Z def test_silu_mul_quant( 2025-05-07T20:33:07.8784899Z self, 2025-05-07T20:33:07.8784982Z T: int, 2025-05-07T20:33:07.8785072Z D: int, 2025-05-07T20:33:07.8785173Z scale_ub: Optional[float], 2025-05-07T20:33:07.8785269Z contiguous: bool, 2025-05-07T20:33:07.8785368Z compiled: bool, 2025-05-07T20:33:07.8785458Z ) -> None: 2025-05-07T20:33:07.8785559Z torch.manual_seed(2025) 2025-05-07T20:33:07.8785645Z 2025-05-07T20:33:07.8785821Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.8785902Z 2025-05-07T20:33:07.8786009Z x_sign = torch.sign(x) 2025-05-07T20:33:07.8786206Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:07.8786354Z x = x_sign * x_clamp 2025-05-07T20:33:07.8786441Z x0 = x[:, :D] 2025-05-07T20:33:07.8786526Z x1 = x[:, D:] 2025-05-07T20:33:07.8786606Z 2025-05-07T20:33:07.8786691Z if contiguous: 2025-05-07T20:33:07.8786785Z x0 = x0.contiguous() 2025-05-07T20:33:07.8786886Z x1 = x1.contiguous() 2025-05-07T20:33:07.8787026Z 2025-05-07T20:33:07.8787120Z if scale_ub is not None: 2025-05-07T20:33:07.8787232Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:07.8787375Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:07.8787449Z ) 2025-05-07T20:33:07.8787532Z else: 2025-05-07T20:33:07.8787627Z scale_ub_tensor = None 2025-05-07T20:33:07.8787698Z 2025-05-07T20:33:07.8787837Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:07.8787927Z op = silu_mul_quant 2025-05-07T20:33:07.8788022Z if compiled: 2025-05-07T20:33:07.8788122Z op = torch.compile(op) 2025-05-07T20:33:07.8788231Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.8788310Z 2025-05-07T20:33:07.8788405Z > y_fp8, y_scale = fn() 2025-05-07T20:33:07.8788410Z 2025-05-07T20:33:07.8788507Z moe/activation_test.py:117: 2025-05-07T20:33:07.8788646Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.8788755Z moe/activation_test.py:115: in fn 2025-05-07T20:33:07.8788859Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.8789411Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:07.8789511Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:07.8790055Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:07.8790294Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:07.8790666Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:07.8790770Z kernel = self.compile( 2025-05-07T20:33:07.8791187Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:07.8791380Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:07.8791518Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.8791523Z 2025-05-07T20:33:07.8791790Z self = 2025-05-07T20:33:07.8792647Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:07.8793199Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd1b8634280>} 2025-05-07T20:33:07.8794026Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:07.8794227Z context = 2025-05-07T20:33:07.8794231Z 2025-05-07T20:33:07.8794403Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:07.8794690Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:07.8794800Z module_map=module_map) 2025-05-07T20:33:07.8794971Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:07.8795110Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:07.8795188Z E ^ 2025-05-07T20:33:07.8795577Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:07.8795619Z 2025-05-07T20:33:07.8796080Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:07.8796084Z 2025-05-07T20:33:07.8796189Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.8796475Z self=, 2025-05-07T20:33:07.8796559Z T=16384, 2025-05-07T20:33:07.8796637Z D=7168, 2025-05-07T20:33:07.8796731Z scale_ub=1200.0, 2025-05-07T20:33:07.8796821Z contiguous=False, 2025-05-07T20:33:07.8796906Z compiled=True, 2025-05-07T20:33:07.8796990Z ) 2025-05-07T20:33:07.8797216Z self = 2025-05-07T20:33:07.8797413Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:07.8797425Z 2025-05-07T20:33:07.8797512Z @given( 2025-05-07T20:33:07.8797634Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.8797746Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.8797868Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.8797989Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.8798116Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.8798199Z ) 2025-05-07T20:33:07.8798462Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.8798571Z def test_silu_mul_quant( 2025-05-07T20:33:07.8798652Z self, 2025-05-07T20:33:07.8798733Z T: int, 2025-05-07T20:33:07.8798818Z D: int, 2025-05-07T20:33:07.8798919Z scale_ub: Optional[float], 2025-05-07T20:33:07.8799018Z contiguous: bool, 2025-05-07T20:33:07.8799109Z compiled: bool, 2025-05-07T20:33:07.8799196Z ) -> None: 2025-05-07T20:33:07.8799301Z torch.manual_seed(2025) 2025-05-07T20:33:07.8799381Z 2025-05-07T20:33:07.8799558Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.8799640Z 2025-05-07T20:33:07.8799737Z x_sign = torch.sign(x) 2025-05-07T20:33:07.8799867Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:07.8799967Z x = x_sign * x_clamp 2025-05-07T20:33:07.8800056Z x0 = x[:, :D] 2025-05-07T20:33:07.8800141Z x1 = x[:, D:] 2025-05-07T20:33:07.8800226Z 2025-05-07T20:33:07.8800314Z if contiguous: 2025-05-07T20:33:07.8800456Z x0 = x0.contiguous() 2025-05-07T20:33:07.8800556Z x1 = x1.contiguous() 2025-05-07T20:33:07.8800633Z 2025-05-07T20:33:07.8800734Z if scale_ub is not None: 2025-05-07T20:33:07.8800844Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:07.8800987Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:07.8801070Z ) 2025-05-07T20:33:07.8801154Z else: 2025-05-07T20:33:07.8801251Z scale_ub_tensor = None 2025-05-07T20:33:07.8801331Z 2025-05-07T20:33:07.8801462Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:07.8801551Z op = silu_mul_quant 2025-05-07T20:33:07.8801644Z if compiled: 2025-05-07T20:33:07.8801745Z op = torch.compile(op) 2025-05-07T20:33:07.8801853Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.8801933Z 2025-05-07T20:33:07.8802028Z > y_fp8, y_scale = fn() 2025-05-07T20:33:07.8802033Z 2025-05-07T20:33:07.8802143Z moe/activation_test.py:117: 2025-05-07T20:33:07.8802277Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.8802378Z moe/activation_test.py:115: in fn 2025-05-07T20:33:07.8802487Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.8802956Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:07.8803091Z return fn(*args, **kwargs) 2025-05-07T20:33:07.8803637Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:07.8803735Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:07.8804122Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:07.8804433Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:07.8804804Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:07.8804909Z kernel = self.compile( 2025-05-07T20:33:07.8805323Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:07.8805517Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:07.8805652Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.8805660Z 2025-05-07T20:33:07.8805877Z self = 2025-05-07T20:33:07.8806732Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:07.8807290Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd1b8634ee0>} 2025-05-07T20:33:07.8808110Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:07.8808310Z context = 2025-05-07T20:33:07.8808317Z 2025-05-07T20:33:07.8808488Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:07.8808772Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:07.8808883Z module_map=module_map) 2025-05-07T20:33:07.8809058Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:07.8809163Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:07.8809244Z E ^ 2025-05-07T20:33:07.8809679Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:07.8809684Z 2025-05-07T20:33:07.8810131Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:07.8810136Z 2025-05-07T20:33:07.8810248Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.8810485Z self=, 2025-05-07T20:33:07.8810572Z T=1, 2025-05-07T20:33:07.8810655Z D=7168, 2025-05-07T20:33:07.8810739Z scale_ub=None, 2025-05-07T20:33:07.8810826Z contiguous=False, 2025-05-07T20:33:07.8810917Z compiled=False, 2025-05-07T20:33:07.8810990Z ) 2025-05-07T20:33:07.8811217Z self = 2025-05-07T20:33:07.8811401Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:07.8811406Z 2025-05-07T20:33:07.8811485Z @given( 2025-05-07T20:33:07.8811607Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.8811715Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.8811834Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.8811964Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.8812081Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.8812200Z ) 2025-05-07T20:33:07.8812469Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.8812607Z def test_silu_mul_quant( 2025-05-07T20:33:07.8812689Z self, 2025-05-07T20:33:07.8812773Z T: int, 2025-05-07T20:33:07.8812852Z D: int, 2025-05-07T20:33:07.8812952Z scale_ub: Optional[float], 2025-05-07T20:33:07.8813050Z contiguous: bool, 2025-05-07T20:33:07.8813173Z compiled: bool, 2025-05-07T20:33:07.8813259Z ) -> None: 2025-05-07T20:33:07.8813354Z torch.manual_seed(2025) 2025-05-07T20:33:07.8813428Z 2025-05-07T20:33:07.8813613Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.8813689Z 2025-05-07T20:33:07.8813781Z x_sign = torch.sign(x) 2025-05-07T20:33:07.8813912Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:07.8814002Z x = x_sign * x_clamp 2025-05-07T20:33:07.8814087Z x0 = x[:, :D] 2025-05-07T20:33:07.8814177Z x1 = x[:, D:] 2025-05-07T20:33:07.8814254Z 2025-05-07T20:33:07.8814339Z if contiguous: 2025-05-07T20:33:07.8814441Z x0 = x0.contiguous() 2025-05-07T20:33:07.8814535Z x1 = x1.contiguous() 2025-05-07T20:33:07.8814610Z 2025-05-07T20:33:07.8814713Z if scale_ub is not None: 2025-05-07T20:33:07.8814824Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:07.8814976Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:07.8815058Z ) 2025-05-07T20:33:07.8815140Z else: 2025-05-07T20:33:07.8815248Z scale_ub_tensor = None 2025-05-07T20:33:07.8815326Z 2025-05-07T20:33:07.8815461Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:07.8815561Z op = silu_mul_quant 2025-05-07T20:33:07.8815649Z if compiled: 2025-05-07T20:33:07.8815752Z op = torch.compile(op) 2025-05-07T20:33:07.8815869Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.8815946Z 2025-05-07T20:33:07.8816043Z > y_fp8, y_scale = fn() 2025-05-07T20:33:07.8816053Z 2025-05-07T20:33:07.8816154Z moe/activation_test.py:117: 2025-05-07T20:33:07.8816290Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.8816399Z moe/activation_test.py:115: in fn 2025-05-07T20:33:07.8816503Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.8817046Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:07.8817199Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:07.8817587Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:07.8817834Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:07.8818204Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:07.8818303Z kernel = self.compile( 2025-05-07T20:33:07.8818724Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:07.8818908Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:07.8819042Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.8819048Z 2025-05-07T20:33:07.8819270Z self = 2025-05-07T20:33:07.8820120Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:07.8820712Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd1b87bb670>} 2025-05-07T20:33:07.8821561Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:07.8821768Z context = 2025-05-07T20:33:07.8821773Z 2025-05-07T20:33:07.8821986Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:07.8822264Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:07.8822382Z module_map=module_map) 2025-05-07T20:33:07.8822545Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:07.8822646Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:07.8822737Z E ^ 2025-05-07T20:33:07.8823127Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:07.8823134Z 2025-05-07T20:33:07.8823593Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:07.8823598Z 2025-05-07T20:33:07.8823705Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.8823942Z self=, 2025-05-07T20:33:07.8824034Z T=2048, 2025-05-07T20:33:07.8824115Z D=7168, 2025-05-07T20:33:07.8824201Z scale_ub=None, 2025-05-07T20:33:07.8824297Z contiguous=False, 2025-05-07T20:33:07.8824386Z compiled=True, 2025-05-07T20:33:07.8824470Z ) 2025-05-07T20:33:07.8824700Z self = 2025-05-07T20:33:07.8824883Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:07.8824887Z 2025-05-07T20:33:07.8824975Z @given( 2025-05-07T20:33:07.8825101Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.8825206Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.8825336Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.8825460Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.8825577Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.8825665Z ) 2025-05-07T20:33:07.8825929Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.8826034Z def test_silu_mul_quant( 2025-05-07T20:33:07.8826115Z self, 2025-05-07T20:33:07.8826197Z T: int, 2025-05-07T20:33:07.8826283Z D: int, 2025-05-07T20:33:07.8826430Z scale_ub: Optional[float], 2025-05-07T20:33:07.8826521Z contiguous: bool, 2025-05-07T20:33:07.8826613Z compiled: bool, 2025-05-07T20:33:07.8826692Z ) -> None: 2025-05-07T20:33:07.8826787Z torch.manual_seed(2025) 2025-05-07T20:33:07.8826864Z 2025-05-07T20:33:07.8827039Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.8827119Z 2025-05-07T20:33:07.8827222Z x_sign = torch.sign(x) 2025-05-07T20:33:07.8827349Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:07.8827451Z x = x_sign * x_clamp 2025-05-07T20:33:07.8827535Z x0 = x[:, :D] 2025-05-07T20:33:07.8827621Z x1 = x[:, D:] 2025-05-07T20:33:07.8827706Z 2025-05-07T20:33:07.8827794Z if contiguous: 2025-05-07T20:33:07.8827888Z x0 = x0.contiguous() 2025-05-07T20:33:07.8827990Z x1 = x1.contiguous() 2025-05-07T20:33:07.8828066Z 2025-05-07T20:33:07.8828166Z if scale_ub is not None: 2025-05-07T20:33:07.8828285Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:07.8828424Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:07.8828506Z ) 2025-05-07T20:33:07.8828591Z else: 2025-05-07T20:33:07.8828688Z scale_ub_tensor = None 2025-05-07T20:33:07.8828806Z 2025-05-07T20:33:07.8828951Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:07.8829079Z op = silu_mul_quant 2025-05-07T20:33:07.8829174Z if compiled: 2025-05-07T20:33:07.8829275Z op = torch.compile(op) 2025-05-07T20:33:07.8829383Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.8829465Z 2025-05-07T20:33:07.8829597Z > y_fp8, y_scale = fn() 2025-05-07T20:33:07.8829601Z 2025-05-07T20:33:07.8829699Z moe/activation_test.py:117: 2025-05-07T20:33:07.8829959Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.8830066Z moe/activation_test.py:115: in fn 2025-05-07T20:33:07.8830168Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.8830575Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:07.8830667Z return fn(*args, **kwargs) 2025-05-07T20:33:07.8831213Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:07.8831316Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:07.8831700Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:07.8831939Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:07.8832309Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:07.8832413Z kernel = self.compile( 2025-05-07T20:33:07.8832830Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:07.8833016Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:07.8833158Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.8833165Z 2025-05-07T20:33:07.8833382Z self = 2025-05-07T20:33:07.8834240Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:07.8834789Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd1b85ef550>} 2025-05-07T20:33:07.8835682Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:07.8835889Z context = 2025-05-07T20:33:07.8835893Z 2025-05-07T20:33:07.8836068Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:07.8836356Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:07.8836470Z module_map=module_map) 2025-05-07T20:33:07.8836637Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:07.8836752Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:07.8836833Z E ^ 2025-05-07T20:33:07.8837219Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:07.8837230Z 2025-05-07T20:33:07.8837684Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:07.8837689Z 2025-05-07T20:33:07.8837795Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.8838035Z self=, 2025-05-07T20:33:07.8838117Z T=4096, 2025-05-07T20:33:07.8838237Z D=7168, 2025-05-07T20:33:07.8838332Z scale_ub=None, 2025-05-07T20:33:07.8838457Z contiguous=False, 2025-05-07T20:33:07.8838541Z compiled=True, 2025-05-07T20:33:07.8838624Z ) 2025-05-07T20:33:07.8838852Z self = 2025-05-07T20:33:07.8839043Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:07.8839085Z 2025-05-07T20:33:07.8839165Z @given( 2025-05-07T20:33:07.8839286Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.8839392Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.8839511Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.8839630Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.8839752Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.8839827Z ) 2025-05-07T20:33:07.8840090Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.8840192Z def test_silu_mul_quant( 2025-05-07T20:33:07.8840274Z self, 2025-05-07T20:33:07.8840356Z T: int, 2025-05-07T20:33:07.8840432Z D: int, 2025-05-07T20:33:07.8840531Z scale_ub: Optional[float], 2025-05-07T20:33:07.8840629Z contiguous: bool, 2025-05-07T20:33:07.8840713Z compiled: bool, 2025-05-07T20:33:07.8840791Z ) -> None: 2025-05-07T20:33:07.8840895Z torch.manual_seed(2025) 2025-05-07T20:33:07.8840974Z 2025-05-07T20:33:07.8841148Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.8841228Z 2025-05-07T20:33:07.8841322Z x_sign = torch.sign(x) 2025-05-07T20:33:07.8841449Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:07.8841546Z x = x_sign * x_clamp 2025-05-07T20:33:07.8841628Z x0 = x[:, :D] 2025-05-07T20:33:07.8841716Z x1 = x[:, D:] 2025-05-07T20:33:07.8841791Z 2025-05-07T20:33:07.8841877Z if contiguous: 2025-05-07T20:33:07.8841975Z x0 = x0.contiguous() 2025-05-07T20:33:07.8842068Z x1 = x1.contiguous() 2025-05-07T20:33:07.8842143Z 2025-05-07T20:33:07.8842244Z if scale_ub is not None: 2025-05-07T20:33:07.8842352Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:07.8842491Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:07.8842576Z ) 2025-05-07T20:33:07.8842674Z else: 2025-05-07T20:33:07.8842779Z scale_ub_tensor = None 2025-05-07T20:33:07.8842882Z 2025-05-07T20:33:07.8843015Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:07.8843161Z op = silu_mul_quant 2025-05-07T20:33:07.8843264Z if compiled: 2025-05-07T20:33:07.8843367Z op = torch.compile(op) 2025-05-07T20:33:07.8843479Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.8843555Z 2025-05-07T20:33:07.8843646Z > y_fp8, y_scale = fn() 2025-05-07T20:33:07.8843652Z 2025-05-07T20:33:07.8843757Z moe/activation_test.py:117: 2025-05-07T20:33:07.8843899Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.8843999Z moe/activation_test.py:115: in fn 2025-05-07T20:33:07.8844110Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.8844507Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:07.8844608Z return fn(*args, **kwargs) 2025-05-07T20:33:07.8845154Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:07.8845254Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:07.8845644Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:07.8845877Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:07.8846289Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:07.8846425Z kernel = self.compile( 2025-05-07T20:33:07.8846834Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:07.8847023Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:07.8847154Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.8847198Z 2025-05-07T20:33:07.8847414Z self = 2025-05-07T20:33:07.8848271Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:07.8848823Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd1b8367160>} 2025-05-07T20:33:07.8849650Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:07.8849849Z context = 2025-05-07T20:33:07.8849856Z 2025-05-07T20:33:07.8850032Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:07.8850313Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:07.8850427Z module_map=module_map) 2025-05-07T20:33:07.8850600Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:07.8850702Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:07.8850784Z E ^ 2025-05-07T20:33:07.8851186Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:07.8851193Z 2025-05-07T20:33:07.8851643Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:07.8851648Z 2025-05-07T20:33:07.8851760Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.8851996Z self=, 2025-05-07T20:33:07.8852080Z T=16384, 2025-05-07T20:33:07.8852165Z D=5120, 2025-05-07T20:33:07.8852251Z scale_ub=1200.0, 2025-05-07T20:33:07.8852343Z contiguous=False, 2025-05-07T20:33:07.8852477Z compiled=False, 2025-05-07T20:33:07.8852556Z ) 2025-05-07T20:33:07.8852785Z self = 2025-05-07T20:33:07.8852984Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:07.8852988Z 2025-05-07T20:33:07.8853073Z @given( 2025-05-07T20:33:07.8853199Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.8853301Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.8853417Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.8853541Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.8853655Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.8853730Z ) 2025-05-07T20:33:07.8854002Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.8854096Z def test_silu_mul_quant( 2025-05-07T20:33:07.8854175Z self, 2025-05-07T20:33:07.8854263Z T: int, 2025-05-07T20:33:07.8854342Z D: int, 2025-05-07T20:33:07.8854450Z scale_ub: Optional[float], 2025-05-07T20:33:07.8854539Z contiguous: bool, 2025-05-07T20:33:07.8854627Z compiled: bool, 2025-05-07T20:33:07.8854713Z ) -> None: 2025-05-07T20:33:07.8854806Z torch.manual_seed(2025) 2025-05-07T20:33:07.8854922Z 2025-05-07T20:33:07.8855104Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.8855217Z 2025-05-07T20:33:07.8855308Z x_sign = torch.sign(x) 2025-05-07T20:33:07.8855439Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:07.8855528Z x = x_sign * x_clamp 2025-05-07T20:33:07.8855616Z x0 = x[:, :D] 2025-05-07T20:33:07.8855744Z x1 = x[:, D:] 2025-05-07T20:33:07.8855818Z 2025-05-07T20:33:07.8855909Z if contiguous: 2025-05-07T20:33:07.8856000Z x0 = x0.contiguous() 2025-05-07T20:33:07.8856090Z x1 = x1.contiguous() 2025-05-07T20:33:07.8856170Z 2025-05-07T20:33:07.8856261Z if scale_ub is not None: 2025-05-07T20:33:07.8856368Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:07.8856509Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:07.8856586Z ) 2025-05-07T20:33:07.8856663Z else: 2025-05-07T20:33:07.8856770Z scale_ub_tensor = None 2025-05-07T20:33:07.8856858Z 2025-05-07T20:33:07.8861585Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:07.8861702Z op = silu_mul_quant 2025-05-07T20:33:07.8861796Z if compiled: 2025-05-07T20:33:07.8861901Z op = torch.compile(op) 2025-05-07T20:33:07.8862018Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.8862100Z 2025-05-07T20:33:07.8862202Z > y_fp8, y_scale = fn() 2025-05-07T20:33:07.8862207Z 2025-05-07T20:33:07.8862306Z moe/activation_test.py:117: 2025-05-07T20:33:07.8862448Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.8862563Z moe/activation_test.py:115: in fn 2025-05-07T20:33:07.8862665Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.8863225Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:07.8863334Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:07.8863729Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:07.8863975Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:07.8864341Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:07.8864440Z kernel = self.compile( 2025-05-07T20:33:07.8864866Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:07.8865127Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:07.8865265Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.8865277Z 2025-05-07T20:33:07.8865491Z self = 2025-05-07T20:33:07.8866350Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:07.8866917Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd1b8367940>} 2025-05-07T20:33:07.8867742Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:07.8867948Z context = 2025-05-07T20:33:07.8867952Z 2025-05-07T20:33:07.8868124Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:07.8868452Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:07.8868579Z module_map=module_map) 2025-05-07T20:33:07.8868841Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:07.8868952Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:07.8869032Z E ^ 2025-05-07T20:33:07.8869416Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:07.8869492Z 2025-05-07T20:33:07.8870096Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:07.8870101Z 2025-05-07T20:33:07.8870212Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.8870450Z self=, 2025-05-07T20:33:07.8870541Z T=16384, 2025-05-07T20:33:07.8870619Z D=5120, 2025-05-07T20:33:07.8870716Z scale_ub=1200.0, 2025-05-07T20:33:07.8870803Z contiguous=True, 2025-05-07T20:33:07.8870892Z compiled=True, 2025-05-07T20:33:07.8870981Z ) 2025-05-07T20:33:07.8871214Z self = 2025-05-07T20:33:07.8871396Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:07.8871400Z 2025-05-07T20:33:07.8871489Z @given( 2025-05-07T20:33:07.8871610Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.8871716Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.8871850Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.8871973Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.8872101Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.8872182Z ) 2025-05-07T20:33:07.8872449Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.8872555Z def test_silu_mul_quant( 2025-05-07T20:33:07.8872640Z self, 2025-05-07T20:33:07.8872727Z T: int, 2025-05-07T20:33:07.8872819Z D: int, 2025-05-07T20:33:07.8872927Z scale_ub: Optional[float], 2025-05-07T20:33:07.8873019Z contiguous: bool, 2025-05-07T20:33:07.8873117Z compiled: bool, 2025-05-07T20:33:07.8873203Z ) -> None: 2025-05-07T20:33:07.8873305Z torch.manual_seed(2025) 2025-05-07T20:33:07.8873395Z 2025-05-07T20:33:07.8873575Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.8873664Z 2025-05-07T20:33:07.8873765Z x_sign = torch.sign(x) 2025-05-07T20:33:07.8873895Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:07.8874044Z x = x_sign * x_clamp 2025-05-07T20:33:07.8874132Z x0 = x[:, :D] 2025-05-07T20:33:07.8874213Z x1 = x[:, D:] 2025-05-07T20:33:07.8874297Z 2025-05-07T20:33:07.8874382Z if contiguous: 2025-05-07T20:33:07.8874477Z x0 = x0.contiguous() 2025-05-07T20:33:07.8874574Z x1 = x1.contiguous() 2025-05-07T20:33:07.8874650Z 2025-05-07T20:33:07.8874745Z if scale_ub is not None: 2025-05-07T20:33:07.8874863Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:07.8875007Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:07.8875095Z ) 2025-05-07T20:33:07.8875172Z else: 2025-05-07T20:33:07.8875283Z scale_ub_tensor = None 2025-05-07T20:33:07.8875361Z 2025-05-07T20:33:07.8875501Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:07.8875601Z op = silu_mul_quant 2025-05-07T20:33:07.8875689Z if compiled: 2025-05-07T20:33:07.8875797Z op = torch.compile(op) 2025-05-07T20:33:07.8875920Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.8875998Z 2025-05-07T20:33:07.8876092Z > y_fp8, y_scale = fn() 2025-05-07T20:33:07.8876097Z 2025-05-07T20:33:07.8876202Z moe/activation_test.py:117: 2025-05-07T20:33:07.8876387Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.8876539Z moe/activation_test.py:115: in fn 2025-05-07T20:33:07.8876647Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.8877042Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:07.8877143Z return fn(*args, **kwargs) 2025-05-07T20:33:07.8877682Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:07.8877826Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:07.8878218Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:07.8878453Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:07.8878824Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:07.8878922Z kernel = self.compile( 2025-05-07T20:33:07.8879332Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:07.8879528Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:07.8879663Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.8879668Z 2025-05-07T20:33:07.8879889Z self = 2025-05-07T20:33:07.8880748Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:07.8881302Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd1b81d7550>} 2025-05-07T20:33:07.8882129Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:07.8882332Z context = 2025-05-07T20:33:07.8882337Z 2025-05-07T20:33:07.8882517Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:07.8883114Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:07.8883265Z module_map=module_map) 2025-05-07T20:33:07.8883634Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:07.8883739Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:07.8883819Z E ^ 2025-05-07T20:33:07.8884211Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:07.8884216Z 2025-05-07T20:33:07.8884673Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:07.8884680Z 2025-05-07T20:33:07.8884791Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.8885023Z self=, 2025-05-07T20:33:07.8885103Z T=16384, 2025-05-07T20:33:07.8885189Z D=5120, 2025-05-07T20:33:07.8885278Z scale_ub=None, 2025-05-07T20:33:07.8885370Z contiguous=False, 2025-05-07T20:33:07.8885467Z compiled=True, 2025-05-07T20:33:07.8885545Z ) 2025-05-07T20:33:07.8885787Z self = 2025-05-07T20:33:07.8885977Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:07.8885981Z 2025-05-07T20:33:07.8886063Z @given( 2025-05-07T20:33:07.8886192Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.8886363Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.8886486Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.8886671Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.8886786Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.8886860Z ) 2025-05-07T20:33:07.8887127Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.8887225Z def test_silu_mul_quant( 2025-05-07T20:33:07.8887366Z self, 2025-05-07T20:33:07.8887443Z T: int, 2025-05-07T20:33:07.8887522Z D: int, 2025-05-07T20:33:07.8887627Z scale_ub: Optional[float], 2025-05-07T20:33:07.8887718Z contiguous: bool, 2025-05-07T20:33:07.8887803Z compiled: bool, 2025-05-07T20:33:07.8887892Z ) -> None: 2025-05-07T20:33:07.8887991Z torch.manual_seed(2025) 2025-05-07T20:33:07.8888069Z 2025-05-07T20:33:07.8888252Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.8888333Z 2025-05-07T20:33:07.8888429Z x_sign = torch.sign(x) 2025-05-07T20:33:07.8888566Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:07.8888660Z x = x_sign * x_clamp 2025-05-07T20:33:07.8888752Z x0 = x[:, :D] 2025-05-07T20:33:07.8888835Z x1 = x[:, D:] 2025-05-07T20:33:07.8888912Z 2025-05-07T20:33:07.8889011Z if contiguous: 2025-05-07T20:33:07.8889108Z x0 = x0.contiguous() 2025-05-07T20:33:07.8889205Z x1 = x1.contiguous() 2025-05-07T20:33:07.8889289Z 2025-05-07T20:33:07.8889384Z if scale_ub is not None: 2025-05-07T20:33:07.8889497Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:07.8889644Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:07.8889725Z ) 2025-05-07T20:33:07.8889809Z else: 2025-05-07T20:33:07.8889912Z scale_ub_tensor = None 2025-05-07T20:33:07.8889990Z 2025-05-07T20:33:07.8890129Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:07.8890230Z op = silu_mul_quant 2025-05-07T20:33:07.8890323Z if compiled: 2025-05-07T20:33:07.8890436Z op = torch.compile(op) 2025-05-07T20:33:07.8890547Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.8890628Z 2025-05-07T20:33:07.8890733Z > y_fp8, y_scale = fn() 2025-05-07T20:33:07.8890737Z 2025-05-07T20:33:07.8890840Z moe/activation_test.py:117: 2025-05-07T20:33:07.8890977Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.8891090Z moe/activation_test.py:115: in fn 2025-05-07T20:33:07.8891241Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.8891648Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:07.8891743Z return fn(*args, **kwargs) 2025-05-07T20:33:07.8892288Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:07.8892397Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:07.8892783Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:07.8893017Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:07.8893393Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:07.8893492Z kernel = self.compile( 2025-05-07T20:33:07.8893917Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:07.8894102Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:07.8894237Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.8894242Z 2025-05-07T20:33:07.8894508Z self = 2025-05-07T20:33:07.8895359Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:07.8895953Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd1b83471f0>} 2025-05-07T20:33:07.8896807Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:07.8897004Z context = 2025-05-07T20:33:07.8897014Z 2025-05-07T20:33:07.8897185Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:07.8897466Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:07.8897587Z module_map=module_map) 2025-05-07T20:33:07.8897754Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:07.8897855Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:07.8897945Z E ^ 2025-05-07T20:33:07.8898334Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:07.8898342Z 2025-05-07T20:33:07.8898800Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:07.8898805Z 2025-05-07T20:33:07.8898910Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.8899149Z self=, 2025-05-07T20:33:07.8899240Z T=2048, 2025-05-07T20:33:07.8899324Z D=5120, 2025-05-07T20:33:07.8899412Z scale_ub=None, 2025-05-07T20:33:07.8899511Z contiguous=False, 2025-05-07T20:33:07.8899598Z compiled=True, 2025-05-07T20:33:07.8899679Z ) 2025-05-07T20:33:07.8899918Z self = 2025-05-07T20:33:07.8900104Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:07.8900109Z 2025-05-07T20:33:07.8900196Z @given( 2025-05-07T20:33:07.8900322Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.8900426Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.8900551Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.8900717Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.8900835Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.8900916Z ) 2025-05-07T20:33:07.8901176Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.8901271Z def test_silu_mul_quant( 2025-05-07T20:33:07.8901355Z self, 2025-05-07T20:33:07.8901433Z T: int, 2025-05-07T20:33:07.8901519Z D: int, 2025-05-07T20:33:07.8901617Z scale_ub: Optional[float], 2025-05-07T20:33:07.8901706Z contiguous: bool, 2025-05-07T20:33:07.8901796Z compiled: bool, 2025-05-07T20:33:07.8901876Z ) -> None: 2025-05-07T20:33:07.8901976Z torch.manual_seed(2025) 2025-05-07T20:33:07.8902060Z 2025-05-07T20:33:07.8902240Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.8902318Z 2025-05-07T20:33:07.8902419Z x_sign = torch.sign(x) 2025-05-07T20:33:07.8902550Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:07.8902644Z x = x_sign * x_clamp 2025-05-07T20:33:07.8902739Z x0 = x[:, :D] 2025-05-07T20:33:07.8902825Z x1 = x[:, D:] 2025-05-07T20:33:07.8902907Z 2025-05-07T20:33:07.8902995Z if contiguous: 2025-05-07T20:33:07.8903089Z x0 = x0.contiguous() 2025-05-07T20:33:07.8903232Z x1 = x1.contiguous() 2025-05-07T20:33:07.8903371Z 2025-05-07T20:33:07.8903464Z if scale_ub is not None: 2025-05-07T20:33:07.8903578Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:07.8903715Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:07.8903793Z ) 2025-05-07T20:33:07.8903876Z else: 2025-05-07T20:33:07.8903972Z scale_ub_tensor = None 2025-05-07T20:33:07.8904151Z 2025-05-07T20:33:07.8904290Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:07.8904383Z op = silu_mul_quant 2025-05-07T20:33:07.8904473Z if compiled: 2025-05-07T20:33:07.8904581Z op = torch.compile(op) 2025-05-07T20:33:07.8904687Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.8904769Z 2025-05-07T20:33:07.8904860Z > y_fp8, y_scale = fn() 2025-05-07T20:33:07.8904864Z 2025-05-07T20:33:07.8904963Z moe/activation_test.py:117: 2025-05-07T20:33:07.8905107Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.8905213Z moe/activation_test.py:115: in fn 2025-05-07T20:33:07.8905315Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.8905716Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:07.8905811Z return fn(*args, **kwargs) 2025-05-07T20:33:07.8906359Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:07.8906459Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:07.8906843Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:07.8907083Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:07.8907455Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:07.8907552Z kernel = self.compile( 2025-05-07T20:33:07.8907980Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:07.8908164Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:07.8908305Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.8908313Z 2025-05-07T20:33:07.8908531Z self = 2025-05-07T20:33:07.8909433Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:07.8910148Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd1b8347f70>} 2025-05-07T20:33:07.8910964Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:07.8911172Z context = 2025-05-07T20:33:07.8911177Z 2025-05-07T20:33:07.8911349Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:07.8911640Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:07.8911753Z module_map=module_map) 2025-05-07T20:33:07.8911919Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:07.8912026Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:07.8912104Z E ^ 2025-05-07T20:33:07.8912530Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:07.8912535Z 2025-05-07T20:33:07.8913026Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:07.8913030Z 2025-05-07T20:33:07.8913134Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.8913372Z self=, 2025-05-07T20:33:07.8913451Z T=2048, 2025-05-07T20:33:07.8913571Z D=5120, 2025-05-07T20:33:07.8913661Z scale_ub=1200.0, 2025-05-07T20:33:07.8913747Z contiguous=False, 2025-05-07T20:33:07.8913832Z compiled=True, 2025-05-07T20:33:07.8913909Z ) 2025-05-07T20:33:07.8914139Z self = 2025-05-07T20:33:07.8914321Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:07.8914335Z 2025-05-07T20:33:07.8914419Z @given( 2025-05-07T20:33:07.8914546Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.8914654Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.8914777Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.8914898Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.8915021Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.8915104Z ) 2025-05-07T20:33:07.8915369Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.8915479Z def test_silu_mul_quant( 2025-05-07T20:33:07.8915559Z self, 2025-05-07T20:33:07.8915639Z T: int, 2025-05-07T20:33:07.8915728Z D: int, 2025-05-07T20:33:07.8915833Z scale_ub: Optional[float], 2025-05-07T20:33:07.8915930Z contiguous: bool, 2025-05-07T20:33:07.8916019Z compiled: bool, 2025-05-07T20:33:07.8916099Z ) -> None: 2025-05-07T20:33:07.8916201Z torch.manual_seed(2025) 2025-05-07T20:33:07.8916279Z 2025-05-07T20:33:07.8916458Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.8916541Z 2025-05-07T20:33:07.8916642Z x_sign = torch.sign(x) 2025-05-07T20:33:07.8916771Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:07.8916868Z x = x_sign * x_clamp 2025-05-07T20:33:07.8916953Z x0 = x[:, :D] 2025-05-07T20:33:07.8917037Z x1 = x[:, D:] 2025-05-07T20:33:07.8917120Z 2025-05-07T20:33:07.8917217Z if contiguous: 2025-05-07T20:33:07.8917315Z x0 = x0.contiguous() 2025-05-07T20:33:07.8917415Z x1 = x1.contiguous() 2025-05-07T20:33:07.8917492Z 2025-05-07T20:33:07.8917639Z if scale_ub is not None: 2025-05-07T20:33:07.8917751Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:07.8917889Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:07.8917973Z ) 2025-05-07T20:33:07.8918050Z else: 2025-05-07T20:33:07.8918146Z scale_ub_tensor = None 2025-05-07T20:33:07.8918230Z 2025-05-07T20:33:07.8918361Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:07.8918453Z op = silu_mul_quant 2025-05-07T20:33:07.8918545Z if compiled: 2025-05-07T20:33:07.8918646Z op = torch.compile(op) 2025-05-07T20:33:07.8918752Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.8918836Z 2025-05-07T20:33:07.8918932Z > y_fp8, y_scale = fn() 2025-05-07T20:33:07.8918939Z 2025-05-07T20:33:07.8919045Z moe/activation_test.py:117: 2025-05-07T20:33:07.8919179Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.8919286Z moe/activation_test.py:115: in fn 2025-05-07T20:33:07.8919398Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.8919797Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:07.8919895Z return fn(*args, **kwargs) 2025-05-07T20:33:07.8920485Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:07.8920619Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:07.8921011Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:07.8921245Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:07.8921647Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:07.8921748Z kernel = self.compile( 2025-05-07T20:33:07.8922162Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:07.8922348Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:07.8922480Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.8922485Z 2025-05-07T20:33:07.8922701Z self = 2025-05-07T20:33:07.8923559Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:07.8924108Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd1b8195940>} 2025-05-07T20:33:07.8924934Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:07.8925131Z context = 2025-05-07T20:33:07.8925136Z 2025-05-07T20:33:07.8925311Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:07.8925601Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:07.8925714Z module_map=module_map) 2025-05-07T20:33:07.8925886Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:07.8925988Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:07.8926070Z E ^ 2025-05-07T20:33:07.8926463Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:07.8926471Z 2025-05-07T20:33:07.8926966Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:07.8926971Z 2025-05-07T20:33:07.8927087Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.8927320Z self=, 2025-05-07T20:33:07.8927398Z T=4096, 2025-05-07T20:33:07.8927484Z D=5120, 2025-05-07T20:33:07.8927572Z scale_ub=1200.0, 2025-05-07T20:33:07.8927660Z contiguous=True, 2025-05-07T20:33:07.8927755Z compiled=True, 2025-05-07T20:33:07.8927831Z ) 2025-05-07T20:33:07.8928059Z self = 2025-05-07T20:33:07.8928243Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:07.8928248Z 2025-05-07T20:33:07.8928344Z @given( 2025-05-07T20:33:07.8928465Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.8928574Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.8928695Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.8928814Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.8928934Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.8929011Z ) 2025-05-07T20:33:07.8929268Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.8929418Z def test_silu_mul_quant( 2025-05-07T20:33:07.8929502Z self, 2025-05-07T20:33:07.8929623Z T: int, 2025-05-07T20:33:07.8929712Z D: int, 2025-05-07T20:33:07.8929816Z scale_ub: Optional[float], 2025-05-07T20:33:07.8929913Z contiguous: bool, 2025-05-07T20:33:07.8930004Z compiled: bool, 2025-05-07T20:33:07.8930087Z ) -> None: 2025-05-07T20:33:07.8930192Z torch.manual_seed(2025) 2025-05-07T20:33:07.8930306Z 2025-05-07T20:33:07.8930480Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.8930562Z 2025-05-07T20:33:07.8930653Z x_sign = torch.sign(x) 2025-05-07T20:33:07.8930784Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:07.8930881Z x = x_sign * x_clamp 2025-05-07T20:33:07.8930960Z x0 = x[:, :D] 2025-05-07T20:33:07.8931043Z x1 = x[:, D:] 2025-05-07T20:33:07.8931125Z 2025-05-07T20:33:07.8931208Z if contiguous: 2025-05-07T20:33:07.8931304Z x0 = x0.contiguous() 2025-05-07T20:33:07.8931403Z x1 = x1.contiguous() 2025-05-07T20:33:07.8931483Z 2025-05-07T20:33:07.8931581Z if scale_ub is not None: 2025-05-07T20:33:07.8931688Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:07.8931824Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:07.8931905Z ) 2025-05-07T20:33:07.8931983Z else: 2025-05-07T20:33:07.8932082Z scale_ub_tensor = None 2025-05-07T20:33:07.8932166Z 2025-05-07T20:33:07.8932299Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:07.8932396Z op = silu_mul_quant 2025-05-07T20:33:07.8932487Z if compiled: 2025-05-07T20:33:07.8932589Z op = torch.compile(op) 2025-05-07T20:33:07.8932695Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.8932774Z 2025-05-07T20:33:07.8932866Z > y_fp8, y_scale = fn() 2025-05-07T20:33:07.8932870Z 2025-05-07T20:33:07.8932981Z moe/activation_test.py:117: 2025-05-07T20:33:07.8933120Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.8933225Z moe/activation_test.py:115: in fn 2025-05-07T20:33:07.8933336Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.8933734Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:07.8933832Z return fn(*args, **kwargs) 2025-05-07T20:33:07.8934381Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:07.8934528Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:07.8934924Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:07.8935158Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:07.8935523Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:07.8935626Z kernel = self.compile( 2025-05-07T20:33:07.8936037Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:07.8936225Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:07.8936357Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.8936364Z 2025-05-07T20:33:07.8936579Z self = 2025-05-07T20:33:07.8937442Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:07.8938066Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd1b80f7790>} 2025-05-07T20:33:07.8938924Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:07.8939124Z context = 2025-05-07T20:33:07.8939165Z 2025-05-07T20:33:07.8939340Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:07.8939625Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:07.8939737Z module_map=module_map) 2025-05-07T20:33:07.8939908Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:07.8940007Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:07.8940084Z E ^ 2025-05-07T20:33:07.8940476Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:07.8940483Z 2025-05-07T20:33:07.8940931Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:07.8940936Z 2025-05-07T20:33:07.8941049Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.8941282Z self=, 2025-05-07T20:33:07.8941364Z T=128, 2025-05-07T20:33:07.8941451Z D=5120, 2025-05-07T20:33:07.8941534Z scale_ub=1200.0, 2025-05-07T20:33:07.8941622Z contiguous=False, 2025-05-07T20:33:07.8941714Z compiled=True, 2025-05-07T20:33:07.8941788Z ) 2025-05-07T20:33:07.8942016Z self = 2025-05-07T20:33:07.8942200Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:07.8942205Z 2025-05-07T20:33:07.8942284Z @given( 2025-05-07T20:33:07.8942405Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.8942515Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.8942633Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.8942756Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.8942870Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.8942945Z ) 2025-05-07T20:33:07.8943208Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.8943310Z def test_silu_mul_quant( 2025-05-07T20:33:07.8943391Z self, 2025-05-07T20:33:07.8943479Z T: int, 2025-05-07T20:33:07.8943607Z D: int, 2025-05-07T20:33:07.8943711Z scale_ub: Optional[float], 2025-05-07T20:33:07.8943810Z contiguous: bool, 2025-05-07T20:33:07.8943899Z compiled: bool, 2025-05-07T20:33:07.8943987Z ) -> None: 2025-05-07T20:33:07.8944088Z torch.manual_seed(2025) 2025-05-07T20:33:07.8944165Z 2025-05-07T20:33:07.8944350Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.8944432Z 2025-05-07T20:33:07.8944530Z x_sign = torch.sign(x) 2025-05-07T20:33:07.8944666Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:07.8944760Z x = x_sign * x_clamp 2025-05-07T20:33:07.8944844Z x0 = x[:, :D] 2025-05-07T20:33:07.8944934Z x1 = x[:, D:] 2025-05-07T20:33:07.8945013Z 2025-05-07T20:33:07.8945103Z if contiguous: 2025-05-07T20:33:07.8945204Z x0 = x0.contiguous() 2025-05-07T20:33:07.8945298Z x1 = x1.contiguous() 2025-05-07T20:33:07.8945378Z 2025-05-07T20:33:07.8945481Z if scale_ub is not None: 2025-05-07T20:33:07.8945590Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:07.8945736Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:07.8945814Z ) 2025-05-07T20:33:07.8945892Z else: 2025-05-07T20:33:07.8946039Z scale_ub_tensor = None 2025-05-07T20:33:07.8946117Z 2025-05-07T20:33:07.8946286Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:07.8946384Z op = silu_mul_quant 2025-05-07T20:33:07.8946471Z if compiled: 2025-05-07T20:33:07.8946570Z op = torch.compile(op) 2025-05-07T20:33:07.8946684Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.8946754Z 2025-05-07T20:33:07.8946888Z > y_fp8, y_scale = fn() 2025-05-07T20:33:07.8946900Z 2025-05-07T20:33:07.8947000Z moe/activation_test.py:117: 2025-05-07T20:33:07.8947136Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.8947243Z moe/activation_test.py:115: in fn 2025-05-07T20:33:07.8947343Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.8947737Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:07.8947842Z return fn(*args, **kwargs) 2025-05-07T20:33:07.8948380Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:07.8948481Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:07.8948870Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:07.8949103Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:07.8949481Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:07.8949576Z kernel = self.compile( 2025-05-07T20:33:07.8950130Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:07.8950318Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:07.8950454Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.8950458Z 2025-05-07T20:33:07.8950682Z self = 2025-05-07T20:33:07.8951532Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:07.8952084Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd1b80690d0>} 2025-05-07T20:33:07.8952951Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:07.8953150Z context = 2025-05-07T20:33:07.8953155Z 2025-05-07T20:33:07.8953334Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:07.8953615Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:07.8953720Z module_map=module_map) 2025-05-07T20:33:07.8953894Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:07.8953992Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:07.8954081Z E ^ 2025-05-07T20:33:07.8954463Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:07.8954468Z 2025-05-07T20:33:07.8954917Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:07.8954922Z 2025-05-07T20:33:07.8955037Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.8955270Z self=, 2025-05-07T20:33:07.8955399Z T=16384, 2025-05-07T20:33:07.8955478Z D=7168, 2025-05-07T20:33:07.8955599Z scale_ub=1200.0, 2025-05-07T20:33:07.8955690Z contiguous=True, 2025-05-07T20:33:07.8955775Z compiled=True, 2025-05-07T20:33:07.8955850Z ) 2025-05-07T20:33:07.8956083Z self = 2025-05-07T20:33:07.8956265Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:07.8956308Z 2025-05-07T20:33:07.8956384Z @given( 2025-05-07T20:33:07.8956509Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.8956608Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.8956733Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.8956851Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.8956964Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.8957042Z ) 2025-05-07T20:33:07.8957306Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.8957404Z def test_silu_mul_quant( 2025-05-07T20:33:07.8957492Z self, 2025-05-07T20:33:07.8957575Z T: int, 2025-05-07T20:33:07.8957655Z D: int, 2025-05-07T20:33:07.8957763Z scale_ub: Optional[float], 2025-05-07T20:33:07.8957857Z contiguous: bool, 2025-05-07T20:33:07.8957946Z compiled: bool, 2025-05-07T20:33:07.8958034Z ) -> None: 2025-05-07T20:33:07.8958134Z torch.manual_seed(2025) 2025-05-07T20:33:07.8958211Z 2025-05-07T20:33:07.8958394Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.8958472Z 2025-05-07T20:33:07.8958578Z x_sign = torch.sign(x) 2025-05-07T20:33:07.8958706Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:07.8958799Z x = x_sign * x_clamp 2025-05-07T20:33:07.8958893Z x0 = x[:, :D] 2025-05-07T20:33:07.8958978Z x1 = x[:, D:] 2025-05-07T20:33:07.8959055Z 2025-05-07T20:33:07.8959151Z if contiguous: 2025-05-07T20:33:07.8959251Z x0 = x0.contiguous() 2025-05-07T20:33:07.8959347Z x1 = x1.contiguous() 2025-05-07T20:33:07.8959434Z 2025-05-07T20:33:07.8959529Z if scale_ub is not None: 2025-05-07T20:33:07.8959640Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:07.8959788Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:07.8959876Z ) 2025-05-07T20:33:07.8959966Z else: 2025-05-07T20:33:07.8960065Z scale_ub_tensor = None 2025-05-07T20:33:07.8960142Z 2025-05-07T20:33:07.8960329Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:07.8960422Z op = silu_mul_quant 2025-05-07T20:33:07.8960507Z if compiled: 2025-05-07T20:33:07.8960615Z op = torch.compile(op) 2025-05-07T20:33:07.8960724Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.8960799Z 2025-05-07T20:33:07.8960900Z > y_fp8, y_scale = fn() 2025-05-07T20:33:07.8960905Z 2025-05-07T20:33:07.8961006Z moe/activation_test.py:117: 2025-05-07T20:33:07.8961147Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.8961249Z moe/activation_test.py:115: in fn 2025-05-07T20:33:07.8961349Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.8961750Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:07.8961846Z return fn(*args, **kwargs) 2025-05-07T20:33:07.8962386Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:07.8962491Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:07.8962873Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:07.8963155Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:07.8963522Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:07.8963655Z kernel = self.compile( 2025-05-07T20:33:07.8964070Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:07.8964250Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:07.8964423Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.8964434Z 2025-05-07T20:33:07.8964653Z self = 2025-05-07T20:33:07.8965506Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:07.8966066Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd1b8069d30>} 2025-05-07T20:33:07.8966889Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:07.8967093Z context = 2025-05-07T20:33:07.8967101Z 2025-05-07T20:33:07.8967275Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:07.8967559Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:07.8967680Z module_map=module_map) 2025-05-07T20:33:07.8967848Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:07.8967951Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:07.8968040Z E ^ 2025-05-07T20:33:07.8968426Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:07.8968434Z 2025-05-07T20:33:07.8968889Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:07.8968894Z 2025-05-07T20:33:07.8969000Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.8969235Z self=, 2025-05-07T20:33:07.8969329Z T=16384, 2025-05-07T20:33:07.8969410Z D=5120, 2025-05-07T20:33:07.8969505Z scale_ub=1200.0, 2025-05-07T20:33:07.8969668Z contiguous=True, 2025-05-07T20:33:07.8969755Z compiled=False, 2025-05-07T20:33:07.8969838Z ) 2025-05-07T20:33:07.8970066Z self = 2025-05-07T20:33:07.8970250Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:07.8970255Z 2025-05-07T20:33:07.8970342Z @given( 2025-05-07T20:33:07.8970464Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.8970566Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.8970687Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.8970804Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.8970925Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.8971005Z ) 2025-05-07T20:33:07.8971267Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.8971370Z def test_silu_mul_quant( 2025-05-07T20:33:07.8971449Z self, 2025-05-07T20:33:07.8971530Z T: int, 2025-05-07T20:33:07.8971615Z D: int, 2025-05-07T20:33:07.8971715Z scale_ub: Optional[float], 2025-05-07T20:33:07.8971803Z contiguous: bool, 2025-05-07T20:33:07.8971894Z compiled: bool, 2025-05-07T20:33:07.8971973Z ) -> None: 2025-05-07T20:33:07.8972111Z torch.manual_seed(2025) 2025-05-07T20:33:07.8972197Z 2025-05-07T20:33:07.8972408Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.8972483Z 2025-05-07T20:33:07.8972582Z x_sign = torch.sign(x) 2025-05-07T20:33:07.8972708Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:07.8972808Z x = x_sign * x_clamp 2025-05-07T20:33:07.8972891Z x0 = x[:, :D] 2025-05-07T20:33:07.8973013Z x1 = x[:, D:] 2025-05-07T20:33:07.8973093Z 2025-05-07T20:33:07.8973177Z if contiguous: 2025-05-07T20:33:07.8973269Z x0 = x0.contiguous() 2025-05-07T20:33:07.8973372Z x1 = x1.contiguous() 2025-05-07T20:33:07.8973447Z 2025-05-07T20:33:07.8973539Z if scale_ub is not None: 2025-05-07T20:33:07.8973653Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:07.8973791Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:07.8973866Z ) 2025-05-07T20:33:07.8973961Z else: 2025-05-07T20:33:07.8974057Z scale_ub_tensor = None 2025-05-07T20:33:07.8974145Z 2025-05-07T20:33:07.8974277Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:07.8974368Z op = silu_mul_quant 2025-05-07T20:33:07.8974466Z if compiled: 2025-05-07T20:33:07.8974568Z op = torch.compile(op) 2025-05-07T20:33:07.8974674Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.8974760Z 2025-05-07T20:33:07.8974853Z > y_fp8, y_scale = fn() 2025-05-07T20:33:07.8974858Z 2025-05-07T20:33:07.8974955Z moe/activation_test.py:117: 2025-05-07T20:33:07.8975102Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.8975209Z moe/activation_test.py:115: in fn 2025-05-07T20:33:07.8975320Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.8975867Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:07.8975969Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:07.8976365Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:07.8976603Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:07.8976971Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:07.8977078Z kernel = self.compile( 2025-05-07T20:33:07.8977539Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:07.8977732Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:07.8977864Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.8977869Z 2025-05-07T20:33:07.8978087Z self = 2025-05-07T20:33:07.8978947Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:07.8979499Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd1b8018700>} 2025-05-07T20:33:07.8980333Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:07.8980532Z context = 2025-05-07T20:33:07.8980536Z 2025-05-07T20:33:07.8980716Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:07.8981039Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:07.8981185Z module_map=module_map) 2025-05-07T20:33:07.8981356Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:07.8981456Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:07.8981534Z E ^ 2025-05-07T20:33:07.8981924Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:07.8981967Z 2025-05-07T20:33:07.8982417Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:07.8982424Z 2025-05-07T20:33:07.8982531Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.8983017Z self=, 2025-05-07T20:33:07.8983132Z T=1, 2025-05-07T20:33:07.8983246Z D=7168, 2025-05-07T20:33:07.8983362Z scale_ub=1200.0, 2025-05-07T20:33:07.8983482Z contiguous=False, 2025-05-07T20:33:07.8983611Z compiled=False, 2025-05-07T20:33:07.8983715Z ) 2025-05-07T20:33:07.8983958Z self = 2025-05-07T20:33:07.8984141Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:07.8984146Z 2025-05-07T20:33:07.8984227Z @given( 2025-05-07T20:33:07.8984356Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.8984462Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.8984582Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.8984714Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.8984830Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.8984911Z ) 2025-05-07T20:33:07.8986621Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.8986718Z def test_silu_mul_quant( 2025-05-07T20:33:07.8986798Z self, 2025-05-07T20:33:07.8986889Z T: int, 2025-05-07T20:33:07.8986971Z D: int, 2025-05-07T20:33:07.8987083Z scale_ub: Optional[float], 2025-05-07T20:33:07.8987174Z contiguous: bool, 2025-05-07T20:33:07.8987265Z compiled: bool, 2025-05-07T20:33:07.8987353Z ) -> None: 2025-05-07T20:33:07.8987451Z torch.manual_seed(2025) 2025-05-07T20:33:07.8987527Z 2025-05-07T20:33:07.8987710Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.8987792Z 2025-05-07T20:33:07.8987886Z x_sign = torch.sign(x) 2025-05-07T20:33:07.8988022Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:07.8988288Z x = x_sign * x_clamp 2025-05-07T20:33:07.8988373Z x0 = x[:, :D] 2025-05-07T20:33:07.8988467Z x1 = x[:, D:] 2025-05-07T20:33:07.8988543Z 2025-05-07T20:33:07.8988628Z if contiguous: 2025-05-07T20:33:07.8988731Z x0 = x0.contiguous() 2025-05-07T20:33:07.8988825Z x1 = x1.contiguous() 2025-05-07T20:33:07.8988911Z 2025-05-07T20:33:07.8989009Z if scale_ub is not None: 2025-05-07T20:33:07.8989118Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:07.8989263Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:07.8989340Z ) 2025-05-07T20:33:07.8989418Z else: 2025-05-07T20:33:07.8989520Z scale_ub_tensor = None 2025-05-07T20:33:07.8989599Z 2025-05-07T20:33:07.8989733Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:07.8989936Z op = silu_mul_quant 2025-05-07T20:33:07.8990025Z if compiled: 2025-05-07T20:33:07.8990131Z op = torch.compile(op) 2025-05-07T20:33:07.8990246Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.8990323Z 2025-05-07T20:33:07.8990422Z > y_fp8, y_scale = fn() 2025-05-07T20:33:07.8990427Z 2025-05-07T20:33:07.8990526Z moe/activation_test.py:117: 2025-05-07T20:33:07.8990745Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.8990911Z moe/activation_test.py:115: in fn 2025-05-07T20:33:07.8991026Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.8996123Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:07.8996248Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:07.8996763Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:07.8997010Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:07.8997380Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:07.8997479Z kernel = self.compile( 2025-05-07T20:33:07.8997908Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:07.8998095Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:07.8998233Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.8998248Z 2025-05-07T20:33:07.8998464Z self = 2025-05-07T20:33:07.8999317Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:07.8999892Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd1b7f530d0>} 2025-05-07T20:33:07.9000719Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:07.9000931Z context = 2025-05-07T20:33:07.9000937Z 2025-05-07T20:33:07.9001112Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:07.9001394Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:07.9001519Z module_map=module_map) 2025-05-07T20:33:07.9001694Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:07.9001808Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:07.9001897Z E ^ 2025-05-07T20:33:07.9002335Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:07.9002341Z 2025-05-07T20:33:07.9002832Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:07.9002838Z 2025-05-07T20:33:07.9002968Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.9003204Z self=, 2025-05-07T20:33:07.9003298Z T=4096, 2025-05-07T20:33:07.9003380Z D=7168, 2025-05-07T20:33:07.9003481Z scale_ub=1200.0, 2025-05-07T20:33:07.9003573Z contiguous=False, 2025-05-07T20:33:07.9003666Z compiled=True, 2025-05-07T20:33:07.9003757Z ) 2025-05-07T20:33:07.9003993Z self = 2025-05-07T20:33:07.9004183Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:07.9004188Z 2025-05-07T20:33:07.9004283Z @given( 2025-05-07T20:33:07.9004408Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.9004515Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.9004652Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.9004818Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.9004946Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.9005065Z ) 2025-05-07T20:33:07.9005329Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.9005440Z def test_silu_mul_quant( 2025-05-07T20:33:07.9005527Z self, 2025-05-07T20:33:07.9005613Z T: int, 2025-05-07T20:33:07.9005712Z D: int, 2025-05-07T20:33:07.9005888Z scale_ub: Optional[float], 2025-05-07T20:33:07.9005984Z contiguous: bool, 2025-05-07T20:33:07.9006081Z compiled: bool, 2025-05-07T20:33:07.9006165Z ) -> None: 2025-05-07T20:33:07.9006267Z torch.manual_seed(2025) 2025-05-07T20:33:07.9006354Z 2025-05-07T20:33:07.9006531Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.9006619Z 2025-05-07T20:33:07.9006716Z x_sign = torch.sign(x) 2025-05-07T20:33:07.9006844Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:07.9006947Z x = x_sign * x_clamp 2025-05-07T20:33:07.9007035Z x0 = x[:, :D] 2025-05-07T20:33:07.9007121Z x1 = x[:, D:] 2025-05-07T20:33:07.9007205Z 2025-05-07T20:33:07.9007293Z if contiguous: 2025-05-07T20:33:07.9007389Z x0 = x0.contiguous() 2025-05-07T20:33:07.9007494Z x1 = x1.contiguous() 2025-05-07T20:33:07.9007571Z 2025-05-07T20:33:07.9007668Z if scale_ub is not None: 2025-05-07T20:33:07.9007788Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:07.9007931Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:07.9008024Z ) 2025-05-07T20:33:07.9008112Z else: 2025-05-07T20:33:07.9008213Z scale_ub_tensor = None 2025-05-07T20:33:07.9008297Z 2025-05-07T20:33:07.9008434Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:07.9008530Z op = silu_mul_quant 2025-05-07T20:33:07.9008628Z if compiled: 2025-05-07T20:33:07.9008739Z op = torch.compile(op) 2025-05-07T20:33:07.9008854Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.9008943Z 2025-05-07T20:33:07.9009042Z > y_fp8, y_scale = fn() 2025-05-07T20:33:07.9009046Z 2025-05-07T20:33:07.9009152Z moe/activation_test.py:117: 2025-05-07T20:33:07.9009290Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.9009403Z moe/activation_test.py:115: in fn 2025-05-07T20:33:07.9009517Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.9009964Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:07.9010067Z return fn(*args, **kwargs) 2025-05-07T20:33:07.9010607Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:07.9010709Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:07.9011103Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:07.9011346Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:07.9011721Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:07.9011819Z kernel = self.compile( 2025-05-07T20:33:07.9012233Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:07.9012425Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:07.9012564Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.9012569Z 2025-05-07T20:33:07.9012786Z self = 2025-05-07T20:33:07.9013686Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:07.9014273Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd1b7f53dc0>} 2025-05-07T20:33:07.9015095Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:07.9015337Z context = 2025-05-07T20:33:07.9015341Z 2025-05-07T20:33:07.9015521Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:07.9015798Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:07.9015909Z module_map=module_map) 2025-05-07T20:33:07.9016083Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:07.9016188Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:07.9016267Z E ^ 2025-05-07T20:33:07.9016656Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:07.9016661Z 2025-05-07T20:33:07.9017110Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:07.9017118Z 2025-05-07T20:33:07.9017232Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.9017468Z self=, 2025-05-07T20:33:07.9017552Z T=128, 2025-05-07T20:33:07.9017642Z D=7168, 2025-05-07T20:33:07.9017730Z scale_ub=1200.0, 2025-05-07T20:33:07.9017820Z contiguous=False, 2025-05-07T20:33:07.9017917Z compiled=True, 2025-05-07T20:33:07.9017997Z ) 2025-05-07T20:33:07.9018237Z self = 2025-05-07T20:33:07.9018423Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:07.9018428Z 2025-05-07T20:33:07.9018511Z @given( 2025-05-07T20:33:07.9018642Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.9018749Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.9018869Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.9019006Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.9019125Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.9019206Z ) 2025-05-07T20:33:07.9019521Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.9019620Z def test_silu_mul_quant( 2025-05-07T20:33:07.9019702Z self, 2025-05-07T20:33:07.9019783Z T: int, 2025-05-07T20:33:07.9019863Z D: int, 2025-05-07T20:33:07.9019977Z scale_ub: Optional[float], 2025-05-07T20:33:07.9020070Z contiguous: bool, 2025-05-07T20:33:07.9020158Z compiled: bool, 2025-05-07T20:33:07.9020250Z ) -> None: 2025-05-07T20:33:07.9020349Z torch.manual_seed(2025) 2025-05-07T20:33:07.9020426Z 2025-05-07T20:33:07.9020609Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.9020690Z 2025-05-07T20:33:07.9020786Z x_sign = torch.sign(x) 2025-05-07T20:33:07.9020925Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:07.9021021Z x = x_sign * x_clamp 2025-05-07T20:33:07.9021114Z x0 = x[:, :D] 2025-05-07T20:33:07.9021203Z x1 = x[:, D:] 2025-05-07T20:33:07.9021282Z 2025-05-07T20:33:07.9021378Z if contiguous: 2025-05-07T20:33:07.9021478Z x0 = x0.contiguous() 2025-05-07T20:33:07.9021572Z x1 = x1.contiguous() 2025-05-07T20:33:07.9021659Z 2025-05-07T20:33:07.9021756Z if scale_ub is not None: 2025-05-07T20:33:07.9021908Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:07.9022104Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:07.9022184Z ) 2025-05-07T20:33:07.9022265Z else: 2025-05-07T20:33:07.9022371Z scale_ub_tensor = None 2025-05-07T20:33:07.9022454Z 2025-05-07T20:33:07.9022590Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:07.9022693Z op = silu_mul_quant 2025-05-07T20:33:07.9022826Z if compiled: 2025-05-07T20:33:07.9022936Z op = torch.compile(op) 2025-05-07T20:33:07.9023045Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.9023126Z 2025-05-07T20:33:07.9023225Z > y_fp8, y_scale = fn() 2025-05-07T20:33:07.9023230Z 2025-05-07T20:33:07.9023330Z moe/activation_test.py:117: 2025-05-07T20:33:07.9023465Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.9023578Z moe/activation_test.py:115: in fn 2025-05-07T20:33:07.9023687Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.9024093Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:07.9024191Z return fn(*args, **kwargs) 2025-05-07T20:33:07.9024735Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:07.9024849Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:07.9025238Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:07.9025480Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:07.9025854Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:07.9025955Z kernel = self.compile( 2025-05-07T20:33:07.9026381Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:07.9026567Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:07.9026702Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.9026706Z 2025-05-07T20:33:07.9026930Z self = 2025-05-07T20:33:07.9027782Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:07.9028385Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd1b7e64940>} 2025-05-07T20:33:07.9029205Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:07.9029407Z context = 2025-05-07T20:33:07.9029418Z 2025-05-07T20:33:07.9029591Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:07.9030000Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:07.9030120Z module_map=module_map) 2025-05-07T20:33:07.9030285Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:07.9030390Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:07.9030477Z E ^ 2025-05-07T20:33:07.9030858Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:07.9030863Z 2025-05-07T20:33:07.9031364Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:07.9031369Z 2025-05-07T20:33:07.9031511Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.9031744Z self=, 2025-05-07T20:33:07.9031832Z T=2048, 2025-05-07T20:33:07.9031910Z D=7168, 2025-05-07T20:33:07.9031994Z scale_ub=None, 2025-05-07T20:33:07.9032088Z contiguous=True, 2025-05-07T20:33:07.9032212Z compiled=True, 2025-05-07T20:33:07.9032289Z ) 2025-05-07T20:33:07.9032527Z self = 2025-05-07T20:33:07.9032736Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:07.9032740Z 2025-05-07T20:33:07.9032845Z @given( 2025-05-07T20:33:07.9032980Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.9033083Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.9033210Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.9033336Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.9033456Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.9033544Z ) 2025-05-07T20:33:07.9033809Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.9033907Z def test_silu_mul_quant( 2025-05-07T20:33:07.9033998Z self, 2025-05-07T20:33:07.9034081Z T: int, 2025-05-07T20:33:07.9034171Z D: int, 2025-05-07T20:33:07.9034276Z scale_ub: Optional[float], 2025-05-07T20:33:07.9034370Z contiguous: bool, 2025-05-07T20:33:07.9034466Z compiled: bool, 2025-05-07T20:33:07.9034553Z ) -> None: 2025-05-07T20:33:07.9034655Z torch.manual_seed(2025) 2025-05-07T20:33:07.9034740Z 2025-05-07T20:33:07.9034918Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.9034998Z 2025-05-07T20:33:07.9035100Z x_sign = torch.sign(x) 2025-05-07T20:33:07.9035232Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:07.9035329Z x = x_sign * x_clamp 2025-05-07T20:33:07.9035426Z x0 = x[:, :D] 2025-05-07T20:33:07.9035509Z x1 = x[:, D:] 2025-05-07T20:33:07.9035591Z 2025-05-07T20:33:07.9035678Z if contiguous: 2025-05-07T20:33:07.9035776Z x0 = x0.contiguous() 2025-05-07T20:33:07.9035875Z x1 = x1.contiguous() 2025-05-07T20:33:07.9035958Z 2025-05-07T20:33:07.9036052Z if scale_ub is not None: 2025-05-07T20:33:07.9036167Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:07.9036307Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:07.9036437Z ) 2025-05-07T20:33:07.9036524Z else: 2025-05-07T20:33:07.9036622Z scale_ub_tensor = None 2025-05-07T20:33:07.9036697Z 2025-05-07T20:33:07.9036835Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:07.9036926Z op = silu_mul_quant 2025-05-07T20:33:07.9037015Z if compiled: 2025-05-07T20:33:07.9037123Z op = torch.compile(op) 2025-05-07T20:33:07.9037232Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.9037311Z 2025-05-07T20:33:07.9037404Z > y_fp8, y_scale = fn() 2025-05-07T20:33:07.9037409Z 2025-05-07T20:33:07.9037509Z moe/activation_test.py:117: 2025-05-07T20:33:07.9037651Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.9037760Z moe/activation_test.py:115: in fn 2025-05-07T20:33:07.9037866Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.9038278Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:07.9038377Z return fn(*args, **kwargs) 2025-05-07T20:33:07.9038927Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:07.9039073Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:07.9039458Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:07.9039768Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:07.9040136Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:07.9040235Z kernel = self.compile( 2025-05-07T20:33:07.9040699Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:07.9040882Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:07.9041024Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.9041028Z 2025-05-07T20:33:07.9041245Z self = 2025-05-07T20:33:07.9042100Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:07.9042670Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd1b7c6a550>} 2025-05-07T20:33:07.9043539Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:07.9043752Z context = 2025-05-07T20:33:07.9043756Z 2025-05-07T20:33:07.9043930Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:07.9044222Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:07.9044338Z module_map=module_map) 2025-05-07T20:33:07.9044507Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:07.9044625Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:07.9044710Z E ^ 2025-05-07T20:33:07.9045102Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:07.9045106Z 2025-05-07T20:33:07.9045570Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:07.9045575Z 2025-05-07T20:33:07.9045683Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.9045976Z self=, 2025-05-07T20:33:07.9046059Z T=16384, 2025-05-07T20:33:07.9046139Z D=5120, 2025-05-07T20:33:07.9046229Z scale_ub=None, 2025-05-07T20:33:07.9046317Z contiguous=False, 2025-05-07T20:33:07.9046408Z compiled=False, 2025-05-07T20:33:07.9046492Z ) 2025-05-07T20:33:07.9046720Z self = 2025-05-07T20:33:07.9046909Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:07.9046920Z 2025-05-07T20:33:07.9047000Z @given( 2025-05-07T20:33:07.9047120Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.9047228Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.9047347Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.9047468Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.9047589Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.9047669Z ) 2025-05-07T20:33:07.9047930Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.9048035Z def test_silu_mul_quant( 2025-05-07T20:33:07.9048118Z self, 2025-05-07T20:33:07.9048200Z T: int, 2025-05-07T20:33:07.9048288Z D: int, 2025-05-07T20:33:07.9048431Z scale_ub: Optional[float], 2025-05-07T20:33:07.9048632Z contiguous: bool, 2025-05-07T20:33:07.9048719Z compiled: bool, 2025-05-07T20:33:07.9048799Z ) -> None: 2025-05-07T20:33:07.9048901Z torch.manual_seed(2025) 2025-05-07T20:33:07.9048975Z 2025-05-07T20:33:07.9049147Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.9049235Z 2025-05-07T20:33:07.9049368Z x_sign = torch.sign(x) 2025-05-07T20:33:07.9049495Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:07.9051501Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:07.9051510Z 2025-05-07T20:33:07.9051634Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:07.9051639Z 2025-05-07T20:33:07.9051753Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.9051989Z self=, 2025-05-07T20:33:07.9052080Z T=4096, 2025-05-07T20:33:07.9052163Z D=7168, 2025-05-07T20:33:07.9052255Z scale_ub=1200.0, 2025-05-07T20:33:07.9052350Z contiguous=True, 2025-05-07T20:33:07.9052441Z compiled=True, 2025-05-07T20:33:07.9052522Z ) 2025-05-07T20:33:07.9052787Z self = 2025-05-07T20:33:07.9052994Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:07.9052998Z 2025-05-07T20:33:07.9053081Z @given( 2025-05-07T20:33:07.9053213Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.9053320Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.9053448Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.9053569Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.9053686Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.9053771Z ) 2025-05-07T20:33:07.9054035Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.9054133Z def test_silu_mul_quant( 2025-05-07T20:33:07.9054224Z self, 2025-05-07T20:33:07.9054354Z T: int, 2025-05-07T20:33:07.9054435Z D: int, 2025-05-07T20:33:07.9054543Z scale_ub: Optional[float], 2025-05-07T20:33:07.9054633Z contiguous: bool, 2025-05-07T20:33:07.9054722Z compiled: bool, 2025-05-07T20:33:07.9054806Z ) -> None: 2025-05-07T20:33:07.9054903Z torch.manual_seed(2025) 2025-05-07T20:33:07.9054985Z 2025-05-07T20:33:07.9055157Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.9055237Z 2025-05-07T20:33:07.9055342Z x_sign = torch.sign(x) 2025-05-07T20:33:07.9055468Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:07.9057441Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:07.9057457Z 2025-05-07T20:33:07.9057576Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:07.9057622Z 2025-05-07T20:33:07.9057733Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.9058007Z self=, 2025-05-07T20:33:07.9058089Z T=16384, 2025-05-07T20:33:07.9058172Z D=7168, 2025-05-07T20:33:07.9058269Z scale_ub=None, 2025-05-07T20:33:07.9058360Z contiguous=False, 2025-05-07T20:33:07.9058457Z compiled=False, 2025-05-07T20:33:07.9058579Z ) 2025-05-07T20:33:07.9058809Z self = 2025-05-07T20:33:07.9059000Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:07.9059008Z 2025-05-07T20:33:07.9059087Z @given( 2025-05-07T20:33:07.9059206Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.9059316Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.9059432Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.9059552Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.9059674Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.9059754Z ) 2025-05-07T20:33:07.9060023Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.9060119Z def test_silu_mul_quant( 2025-05-07T20:33:07.9060196Z self, 2025-05-07T20:33:07.9060284Z T: int, 2025-05-07T20:33:07.9060365Z D: int, 2025-05-07T20:33:07.9060467Z scale_ub: Optional[float], 2025-05-07T20:33:07.9060568Z contiguous: bool, 2025-05-07T20:33:07.9060656Z compiled: bool, 2025-05-07T20:33:07.9060738Z ) -> None: 2025-05-07T20:33:07.9060846Z torch.manual_seed(2025) 2025-05-07T20:33:07.9060922Z 2025-05-07T20:33:07.9061095Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.9063088Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:07.9063102Z 2025-05-07T20:33:07.9063225Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:07.9063236Z 2025-05-07T20:33:07.9063342Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.9063621Z self=, 2025-05-07T20:33:07.9063708Z T=2048, 2025-05-07T20:33:07.9063788Z D=7168, 2025-05-07T20:33:07.9063886Z scale_ub=1200.0, 2025-05-07T20:33:07.9063974Z contiguous=True, 2025-05-07T20:33:07.9064070Z compiled=True, 2025-05-07T20:33:07.9064145Z ) 2025-05-07T20:33:07.9064373Z self = 2025-05-07T20:33:07.9064563Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:07.9064568Z 2025-05-07T20:33:07.9064657Z @given( 2025-05-07T20:33:07.9064786Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.9064890Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.9065012Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.9065140Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.9065258Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.9065342Z ) 2025-05-07T20:33:07.9065612Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.9065711Z def test_silu_mul_quant( 2025-05-07T20:33:07.9065793Z self, 2025-05-07T20:33:07.9065883Z T: int, 2025-05-07T20:33:07.9065965Z D: int, 2025-05-07T20:33:07.9066116Z scale_ub: Optional[float], 2025-05-07T20:33:07.9066210Z contiguous: bool, 2025-05-07T20:33:07.9066336Z compiled: bool, 2025-05-07T20:33:07.9066425Z ) -> None: 2025-05-07T20:33:07.9066522Z torch.manual_seed(2025) 2025-05-07T20:33:07.9066602Z 2025-05-07T20:33:07.9066783Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.9066860Z 2025-05-07T20:33:07.9066998Z x_sign = torch.sign(x) 2025-05-07T20:33:07.9067131Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:07.9069084Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:07.9069093Z 2025-05-07T20:33:07.9069219Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:07.9069223Z 2025-05-07T20:33:07.9069332Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.9069576Z self=, 2025-05-07T20:33:07.9069659Z T=2048, 2025-05-07T20:33:07.9069739Z D=7168, 2025-05-07T20:33:07.9069966Z scale_ub=None, 2025-05-07T20:33:07.9070060Z contiguous=True, 2025-05-07T20:33:07.9070147Z compiled=False, 2025-05-07T20:33:07.9070231Z ) 2025-05-07T20:33:07.9070458Z self = 2025-05-07T20:33:07.9070635Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:07.9070639Z 2025-05-07T20:33:07.9070730Z @given( 2025-05-07T20:33:07.9070856Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.9070966Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.9071087Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.9071208Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.9071331Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.9071411Z ) 2025-05-07T20:33:07.9071673Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.9071784Z def test_silu_mul_quant( 2025-05-07T20:33:07.9071868Z self, 2025-05-07T20:33:07.9071952Z T: int, 2025-05-07T20:33:07.9072090Z D: int, 2025-05-07T20:33:07.9072191Z scale_ub: Optional[float], 2025-05-07T20:33:07.9072279Z contiguous: bool, 2025-05-07T20:33:07.9072375Z compiled: bool, 2025-05-07T20:33:07.9072459Z ) -> None: 2025-05-07T20:33:07.9072565Z torch.manual_seed(2025) 2025-05-07T20:33:07.9072643Z 2025-05-07T20:33:07.9072820Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.9072907Z 2025-05-07T20:33:07.9073004Z > x_sign = torch.sign(x) 2025-05-07T20:33:07.9074955Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:07.9074968Z 2025-05-07T20:33:07.9075090Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:33:07.9075095Z 2025-05-07T20:33:07.9075202Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.9075513Z self=, 2025-05-07T20:33:07.9075630Z T=1, 2025-05-07T20:33:07.9075710Z D=7168, 2025-05-07T20:33:07.9075805Z scale_ub=1200.0, 2025-05-07T20:33:07.9075896Z contiguous=True, 2025-05-07T20:33:07.9075985Z compiled=False, 2025-05-07T20:33:07.9076072Z ) 2025-05-07T20:33:07.9076303Z self = 2025-05-07T20:33:07.9076521Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:07.9076526Z 2025-05-07T20:33:07.9076604Z @given( 2025-05-07T20:33:07.9076726Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.9076836Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.9076952Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.9077071Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.9077197Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.9077280Z ) 2025-05-07T20:33:07.9077549Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.9077650Z def test_silu_mul_quant( 2025-05-07T20:33:07.9077733Z self, 2025-05-07T20:33:07.9077822Z T: int, 2025-05-07T20:33:07.9077906Z D: int, 2025-05-07T20:33:07.9078009Z scale_ub: Optional[float], 2025-05-07T20:33:07.9078108Z contiguous: bool, 2025-05-07T20:33:07.9078203Z compiled: bool, 2025-05-07T20:33:07.9078286Z ) -> None: 2025-05-07T20:33:07.9078392Z torch.manual_seed(2025) 2025-05-07T20:33:07.9078473Z 2025-05-07T20:33:07.9078651Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.9078740Z 2025-05-07T20:33:07.9078842Z x_sign = torch.sign(x) 2025-05-07T20:33:07.9078970Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:07.9079071Z x = x_sign * x_clamp 2025-05-07T20:33:07.9079158Z x0 = x[:, :D] 2025-05-07T20:33:07.9079251Z x1 = x[:, D:] 2025-05-07T20:33:07.9079327Z 2025-05-07T20:33:07.9079417Z if contiguous: 2025-05-07T20:33:07.9079521Z x0 = x0.contiguous() 2025-05-07T20:33:07.9079616Z x1 = x1.contiguous() 2025-05-07T20:33:07.9079695Z 2025-05-07T20:33:07.9079797Z if scale_ub is not None: 2025-05-07T20:33:07.9079907Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:07.9080051Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:07.9080138Z ) 2025-05-07T20:33:07.9080220Z else: 2025-05-07T20:33:07.9080318Z scale_ub_tensor = None 2025-05-07T20:33:07.9080449Z 2025-05-07T20:33:07.9080585Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:07.9080685Z op = silu_mul_quant 2025-05-07T20:33:07.9080771Z if compiled: 2025-05-07T20:33:07.9080875Z op = torch.compile(op) 2025-05-07T20:33:07.9080990Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.9081064Z 2025-05-07T20:33:07.9081156Z > y_fp8, y_scale = fn() 2025-05-07T20:33:07.9081162Z 2025-05-07T20:33:07.9081267Z moe/activation_test.py:117: 2025-05-07T20:33:07.9081401Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.9081505Z moe/activation_test.py:115: in fn 2025-05-07T20:33:07.9081610Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.9082158Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:07.9082260Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:07.9082651Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:07.9083225Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:07.9083823Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:07.9083929Z kernel = self.compile( 2025-05-07T20:33:07.9084414Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:07.9084605Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:07.9084736Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.9084805Z 2025-05-07T20:33:07.9085026Z self = 2025-05-07T20:33:07.9085881Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:07.9086441Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd1b7cdc040>} 2025-05-07T20:33:07.9087258Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:07.9087460Z context = 2025-05-07T20:33:07.9087464Z 2025-05-07T20:33:07.9087642Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:07.9087923Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:07.9088043Z module_map=module_map) 2025-05-07T20:33:07.9088211Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:07.9088314Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:07.9088403Z E ^ 2025-05-07T20:33:07.9088792Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:07.9088797Z 2025-05-07T20:33:07.9089250Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:07.9089260Z 2025-05-07T20:33:07.9089368Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.9089604Z self=, 2025-05-07T20:33:07.9089694Z T=128, 2025-05-07T20:33:07.9089780Z D=5120, 2025-05-07T20:33:07.9089866Z scale_ub=None, 2025-05-07T20:33:07.9089960Z contiguous=True, 2025-05-07T20:33:07.9090050Z compiled=False, 2025-05-07T20:33:07.9090127Z ) 2025-05-07T20:33:07.9090433Z self = 2025-05-07T20:33:07.9090613Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:07.9090618Z 2025-05-07T20:33:07.9090709Z @given( 2025-05-07T20:33:07.9090834Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.9090940Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.9091068Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.9091197Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.9091312Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.9091395Z ) 2025-05-07T20:33:07.9091660Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.9091757Z def test_silu_mul_quant( 2025-05-07T20:33:07.9091847Z self, 2025-05-07T20:33:07.9091931Z T: int, 2025-05-07T20:33:07.9092013Z D: int, 2025-05-07T20:33:07.9092125Z scale_ub: Optional[float], 2025-05-07T20:33:07.9092218Z contiguous: bool, 2025-05-07T20:33:07.9092311Z compiled: bool, 2025-05-07T20:33:07.9092393Z ) -> None: 2025-05-07T20:33:07.9092492Z torch.manual_seed(2025) 2025-05-07T20:33:07.9092581Z 2025-05-07T20:33:07.9092835Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.9092916Z 2025-05-07T20:33:07.9093054Z x_sign = torch.sign(x) 2025-05-07T20:33:07.9093180Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:07.9093270Z x = x_sign * x_clamp 2025-05-07T20:33:07.9093356Z x0 = x[:, :D] 2025-05-07T20:33:07.9093438Z x1 = x[:, D:] 2025-05-07T20:33:07.9093514Z 2025-05-07T20:33:07.9093609Z if contiguous: 2025-05-07T20:33:07.9093740Z x0 = x0.contiguous() 2025-05-07T20:33:07.9093837Z x1 = x1.contiguous() 2025-05-07T20:33:07.9093908Z 2025-05-07T20:33:07.9094000Z if scale_ub is not None: 2025-05-07T20:33:07.9094118Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:07.9094255Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:07.9094332Z ) 2025-05-07T20:33:07.9094419Z else: 2025-05-07T20:33:07.9094515Z scale_ub_tensor = None 2025-05-07T20:33:07.9094594Z 2025-05-07T20:33:07.9094733Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:07.9094829Z op = silu_mul_quant 2025-05-07T20:33:07.9094920Z if compiled: 2025-05-07T20:33:07.9095027Z op = torch.compile(op) 2025-05-07T20:33:07.9095134Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.9095210Z 2025-05-07T20:33:07.9095308Z > y_fp8, y_scale = fn() 2025-05-07T20:33:07.9095315Z 2025-05-07T20:33:07.9095415Z moe/activation_test.py:117: 2025-05-07T20:33:07.9095556Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.9095659Z moe/activation_test.py:115: in fn 2025-05-07T20:33:07.9095759Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.9096311Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:07.9096410Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:07.9096799Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:07.9097047Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:07.9097416Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:07.9097519Z kernel = self.compile( 2025-05-07T20:33:07.9097937Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:07.9098125Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:07.9098312Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.9098317Z 2025-05-07T20:33:07.9098533Z self = 2025-05-07T20:33:07.9099393Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:07.9099951Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd1b7cdca60>} 2025-05-07T20:33:07.9100776Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:07.9100978Z context = 2025-05-07T20:33:07.9100982Z 2025-05-07T20:33:07.9101154Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:07.9101438Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:07.9101586Z module_map=module_map) 2025-05-07T20:33:07.9101752Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:07.9101896Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:07.9101975Z E ^ 2025-05-07T20:33:07.9102361Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:07.9102366Z 2025-05-07T20:33:07.9102812Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:07.9102855Z 2025-05-07T20:33:07.9102959Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.9103208Z self=, 2025-05-07T20:33:07.9103291Z T=128, 2025-05-07T20:33:07.9103373Z D=7168, 2025-05-07T20:33:07.9103468Z scale_ub=None, 2025-05-07T20:33:07.9103556Z contiguous=True, 2025-05-07T20:33:07.9103649Z compiled=False, 2025-05-07T20:33:07.9103730Z ) 2025-05-07T20:33:07.9103968Z self = 2025-05-07T20:33:07.9104158Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:07.9104162Z 2025-05-07T20:33:07.9104246Z @given( 2025-05-07T20:33:07.9104370Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.9104480Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.9104601Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.9104725Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.9104850Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.9104930Z ) 2025-05-07T20:33:07.9105201Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.9105301Z def test_silu_mul_quant( 2025-05-07T20:33:07.9105382Z self, 2025-05-07T20:33:07.9105469Z T: int, 2025-05-07T20:33:07.9105550Z D: int, 2025-05-07T20:33:07.9105655Z scale_ub: Optional[float], 2025-05-07T20:33:07.9105754Z contiguous: bool, 2025-05-07T20:33:07.9105847Z compiled: bool, 2025-05-07T20:33:07.9105928Z ) -> None: 2025-05-07T20:33:07.9106036Z torch.manual_seed(2025) 2025-05-07T20:33:07.9106114Z 2025-05-07T20:33:07.9106291Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.9106377Z 2025-05-07T20:33:07.9106474Z x_sign = torch.sign(x) 2025-05-07T20:33:07.9106613Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:07.9106707Z x = x_sign * x_clamp 2025-05-07T20:33:07.9106792Z x0 = x[:, :D] 2025-05-07T20:33:07.9106954Z x1 = x[:, D:] 2025-05-07T20:33:07.9107035Z 2025-05-07T20:33:07.9107117Z if contiguous: 2025-05-07T20:33:07.9107215Z x0 = x0.contiguous() 2025-05-07T20:33:07.9107305Z x1 = x1.contiguous() 2025-05-07T20:33:07.9107382Z 2025-05-07T20:33:07.9107485Z if scale_ub is not None: 2025-05-07T20:33:07.9107591Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:07.9107730Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:07.9107815Z ) 2025-05-07T20:33:07.9107890Z else: 2025-05-07T20:33:07.9107999Z scale_ub_tensor = None 2025-05-07T20:33:07.9108074Z 2025-05-07T20:33:07.9108205Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:07.9108303Z op = silu_mul_quant 2025-05-07T20:33:07.9108389Z if compiled: 2025-05-07T20:33:07.9108490Z op = torch.compile(op) 2025-05-07T20:33:07.9108608Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.9108682Z 2025-05-07T20:33:07.9108774Z > y_fp8, y_scale = fn() 2025-05-07T20:33:07.9108778Z 2025-05-07T20:33:07.9108880Z moe/activation_test.py:117: 2025-05-07T20:33:07.9109015Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.9109172Z moe/activation_test.py:115: in fn 2025-05-07T20:33:07.9109277Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.9110004Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:07.9110114Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:07.9110504Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:07.9110788Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:07.9111168Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:07.9111265Z kernel = self.compile( 2025-05-07T20:33:07.9111687Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:07.9111872Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:07.9112011Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.9112018Z 2025-05-07T20:33:07.9112242Z self = 2025-05-07T20:33:07.9113095Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:07.9113660Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd1b7d6f790>} 2025-05-07T20:33:07.9114478Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:07.9114681Z context = 2025-05-07T20:33:07.9114686Z 2025-05-07T20:33:07.9114868Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:07.9115152Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:07.9115272Z module_map=module_map) 2025-05-07T20:33:07.9115439Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:07.9115545Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:07.9115634Z E ^ 2025-05-07T20:33:07.9116064Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:07.9116069Z 2025-05-07T20:33:07.9116522Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:07.9116527Z 2025-05-07T20:33:07.9116631Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.9116868Z self=, 2025-05-07T20:33:07.9116956Z T=2048, 2025-05-07T20:33:07.9117035Z D=7168, 2025-05-07T20:33:07.9117119Z scale_ub=1200.0, 2025-05-07T20:33:07.9117213Z contiguous=True, 2025-05-07T20:33:07.9117299Z compiled=False, 2025-05-07T20:33:07.9117375Z ) 2025-05-07T20:33:07.9117610Z self = 2025-05-07T20:33:07.9117798Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:07.9117802Z 2025-05-07T20:33:07.9117894Z @given( 2025-05-07T20:33:07.9118020Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.9118127Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.9118254Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.9118377Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.9118493Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.9118580Z ) 2025-05-07T20:33:07.9118882Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.9119013Z def test_silu_mul_quant( 2025-05-07T20:33:07.9119096Z self, 2025-05-07T20:33:07.9119174Z T: int, 2025-05-07T20:33:07.9119256Z D: int, 2025-05-07T20:33:07.9119356Z scale_ub: Optional[float], 2025-05-07T20:33:07.9119446Z contiguous: bool, 2025-05-07T20:33:07.9119575Z compiled: bool, 2025-05-07T20:33:07.9119654Z ) -> None: 2025-05-07T20:33:07.9119749Z torch.manual_seed(2025) 2025-05-07T20:33:07.9119826Z 2025-05-07T20:33:07.9120003Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.9121976Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:07.9121990Z 2025-05-07T20:33:07.9122107Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:07.9122111Z 2025-05-07T20:33:07.9122218Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.9122455Z self=, 2025-05-07T20:33:07.9122534Z T=1, 2025-05-07T20:33:07.9122620Z D=5120, 2025-05-07T20:33:07.9122712Z scale_ub=1200.0, 2025-05-07T20:33:07.9122815Z contiguous=True, 2025-05-07T20:33:07.9122917Z compiled=False, 2025-05-07T20:33:07.9123003Z ) 2025-05-07T20:33:07.9123231Z self = 2025-05-07T20:33:07.9123414Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:07.9123419Z 2025-05-07T20:33:07.9123503Z @given( 2025-05-07T20:33:07.9123625Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.9123735Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.9123854Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.9123980Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.9124104Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.9124183Z ) 2025-05-07T20:33:07.9124451Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.9124590Z def test_silu_mul_quant( 2025-05-07T20:33:07.9124670Z self, 2025-05-07T20:33:07.9124754Z T: int, 2025-05-07T20:33:07.9124833Z D: int, 2025-05-07T20:33:07.9124931Z scale_ub: Optional[float], 2025-05-07T20:33:07.9125027Z contiguous: bool, 2025-05-07T20:33:07.9125111Z compiled: bool, 2025-05-07T20:33:07.9125192Z ) -> None: 2025-05-07T20:33:07.9125299Z torch.manual_seed(2025) 2025-05-07T20:33:07.9125379Z 2025-05-07T20:33:07.9125554Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.9125639Z 2025-05-07T20:33:07.9125736Z x_sign = torch.sign(x) 2025-05-07T20:33:07.9125873Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:07.9125969Z x = x_sign * x_clamp 2025-05-07T20:33:07.9126053Z x0 = x[:, :D] 2025-05-07T20:33:07.9126143Z x1 = x[:, D:] 2025-05-07T20:33:07.9126217Z 2025-05-07T20:33:07.9126307Z if contiguous: 2025-05-07T20:33:07.9126412Z x0 = x0.contiguous() 2025-05-07T20:33:07.9126505Z x1 = x1.contiguous() 2025-05-07T20:33:07.9126596Z 2025-05-07T20:33:07.9131206Z if scale_ub is not None: 2025-05-07T20:33:07.9131329Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:07.9131551Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:07.9131640Z ) 2025-05-07T20:33:07.9131761Z else: 2025-05-07T20:33:07.9131866Z scale_ub_tensor = None 2025-05-07T20:33:07.9131939Z 2025-05-07T20:33:07.9132077Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:07.9132178Z op = silu_mul_quant 2025-05-07T20:33:07.9132266Z if compiled: 2025-05-07T20:33:07.9132368Z op = torch.compile(op) 2025-05-07T20:33:07.9132527Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.9132598Z 2025-05-07T20:33:07.9132691Z > y_fp8, y_scale = fn() 2025-05-07T20:33:07.9132696Z 2025-05-07T20:33:07.9132811Z moe/activation_test.py:117: 2025-05-07T20:33:07.9132952Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.9133067Z moe/activation_test.py:115: in fn 2025-05-07T20:33:07.9133172Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.9133738Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:07.9133852Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:07.9134247Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:07.9134486Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:07.9134869Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:07.9134968Z kernel = self.compile( 2025-05-07T20:33:07.9135399Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:07.9135583Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:07.9135718Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.9135723Z 2025-05-07T20:33:07.9135948Z self = 2025-05-07T20:33:07.9136804Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:07.9137364Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd1b7ca3040>} 2025-05-07T20:33:07.9138231Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:07.9138434Z context = 2025-05-07T20:33:07.9138439Z 2025-05-07T20:33:07.9138621Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:07.9138902Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:07.9139026Z module_map=module_map) 2025-05-07T20:33:07.9139194Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:07.9139296Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:07.9139384Z E ^ 2025-05-07T20:33:07.9139771Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:07.9139779Z 2025-05-07T20:33:07.9140241Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:07.9140246Z 2025-05-07T20:33:07.9140351Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.9140585Z self=, 2025-05-07T20:33:07.9140677Z T=2048, 2025-05-07T20:33:07.9140758Z D=5120, 2025-05-07T20:33:07.9140885Z scale_ub=None, 2025-05-07T20:33:07.9140983Z contiguous=True, 2025-05-07T20:33:07.9141110Z compiled=False, 2025-05-07T20:33:07.9141186Z ) 2025-05-07T20:33:07.9141428Z self = 2025-05-07T20:33:07.9141611Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:07.9141615Z 2025-05-07T20:33:07.9141699Z @given( 2025-05-07T20:33:07.9141888Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.9141988Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.9142115Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.9142234Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.9142349Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.9142436Z ) 2025-05-07T20:33:07.9142697Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.9142798Z def test_silu_mul_quant( 2025-05-07T20:33:07.9142887Z self, 2025-05-07T20:33:07.9142974Z T: int, 2025-05-07T20:33:07.9143062Z D: int, 2025-05-07T20:33:07.9143165Z scale_ub: Optional[float], 2025-05-07T20:33:07.9143259Z contiguous: bool, 2025-05-07T20:33:07.9143357Z compiled: bool, 2025-05-07T20:33:07.9143444Z ) -> None: 2025-05-07T20:33:07.9143545Z torch.manual_seed(2025) 2025-05-07T20:33:07.9143632Z 2025-05-07T20:33:07.9143810Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.9143890Z 2025-05-07T20:33:07.9143993Z > x_sign = torch.sign(x) 2025-05-07T20:33:07.9145976Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:07.9145985Z 2025-05-07T20:33:07.9146110Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:33:07.9146115Z 2025-05-07T20:33:07.9146220Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.9146466Z self=, 2025-05-07T20:33:07.9146546Z T=16384, 2025-05-07T20:33:07.9146623Z D=5120, 2025-05-07T20:33:07.9146758Z scale_ub=None, 2025-05-07T20:33:07.9146845Z contiguous=True, 2025-05-07T20:33:07.9146938Z compiled=False, 2025-05-07T20:33:07.9147013Z ) 2025-05-07T20:33:07.9147240Z self = 2025-05-07T20:33:07.9147427Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:07.9147433Z 2025-05-07T20:33:07.9147510Z @given( 2025-05-07T20:33:07.9147632Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.9147743Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.9147862Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.9147987Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.9148103Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.9148183Z ) 2025-05-07T20:33:07.9148451Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.9148547Z def test_silu_mul_quant( 2025-05-07T20:33:07.9148625Z self, 2025-05-07T20:33:07.9148707Z T: int, 2025-05-07T20:33:07.9148784Z D: int, 2025-05-07T20:33:07.9148884Z scale_ub: Optional[float], 2025-05-07T20:33:07.9148978Z contiguous: bool, 2025-05-07T20:33:07.9149065Z compiled: bool, 2025-05-07T20:33:07.9149144Z ) -> None: 2025-05-07T20:33:07.9149293Z torch.manual_seed(2025) 2025-05-07T20:33:07.9149401Z 2025-05-07T20:33:07.9149573Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.9151685Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:07.9151733Z 2025-05-07T20:33:07.9151860Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:07.9151864Z 2025-05-07T20:33:07.9151968Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.9152201Z self=, 2025-05-07T20:33:07.9152288Z T=4096, 2025-05-07T20:33:07.9152366Z D=5120, 2025-05-07T20:33:07.9152447Z scale_ub=None, 2025-05-07T20:33:07.9152538Z contiguous=True, 2025-05-07T20:33:07.9152624Z compiled=False, 2025-05-07T20:33:07.9152699Z ) 2025-05-07T20:33:07.9152937Z self = 2025-05-07T20:33:07.9153120Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:07.9153124Z 2025-05-07T20:33:07.9153206Z @given( 2025-05-07T20:33:07.9153327Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.9153426Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.9153548Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.9153666Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.9153779Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.9153864Z ) 2025-05-07T20:33:07.9154124Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.9154221Z def test_silu_mul_quant( 2025-05-07T20:33:07.9154305Z self, 2025-05-07T20:33:07.9154383Z T: int, 2025-05-07T20:33:07.9154466Z D: int, 2025-05-07T20:33:07.9154568Z scale_ub: Optional[float], 2025-05-07T20:33:07.9154657Z contiguous: bool, 2025-05-07T20:33:07.9154749Z compiled: bool, 2025-05-07T20:33:07.9154830Z ) -> None: 2025-05-07T20:33:07.9154923Z torch.manual_seed(2025) 2025-05-07T20:33:07.9155003Z 2025-05-07T20:33:07.9155218Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.9157174Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:07.9157189Z 2025-05-07T20:33:07.9157306Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:07.9157312Z 2025-05-07T20:33:07.9157415Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.9157656Z self=, 2025-05-07T20:33:07.9157737Z T=2048, 2025-05-07T20:33:07.9157831Z D=5120, 2025-05-07T20:33:07.9157918Z scale_ub=None, 2025-05-07T20:33:07.9158008Z contiguous=False, 2025-05-07T20:33:07.9158102Z compiled=False, 2025-05-07T20:33:07.9158179Z ) 2025-05-07T20:33:07.9158408Z self = 2025-05-07T20:33:07.9158637Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:07.9158676Z 2025-05-07T20:33:07.9158755Z @given( 2025-05-07T20:33:07.9158875Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.9158980Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.9159095Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.9159220Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.9159374Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.9159449Z ) 2025-05-07T20:33:07.9159722Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.9159821Z def test_silu_mul_quant( 2025-05-07T20:33:07.9159902Z self, 2025-05-07T20:33:07.9159990Z T: int, 2025-05-07T20:33:07.9160069Z D: int, 2025-05-07T20:33:07.9160171Z scale_ub: Optional[float], 2025-05-07T20:33:07.9160271Z contiguous: bool, 2025-05-07T20:33:07.9160363Z compiled: bool, 2025-05-07T20:33:07.9160443Z ) -> None: 2025-05-07T20:33:07.9160557Z torch.manual_seed(2025) 2025-05-07T20:33:07.9160635Z 2025-05-07T20:33:07.9160817Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.9162774Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:07.9162784Z 2025-05-07T20:33:07.9162927Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:07.9162931Z 2025-05-07T20:33:07.9163059Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.9163295Z self=, 2025-05-07T20:33:07.9163380Z T=4096, 2025-05-07T20:33:07.9163457Z D=7168, 2025-05-07T20:33:07.9163548Z scale_ub=None, 2025-05-07T20:33:07.9163639Z contiguous=True, 2025-05-07T20:33:07.9163725Z compiled=True, 2025-05-07T20:33:07.9163800Z ) 2025-05-07T20:33:07.9164041Z self = 2025-05-07T20:33:07.9164216Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:07.9164221Z 2025-05-07T20:33:07.9164345Z @given( 2025-05-07T20:33:07.9164464Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.9164563Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.9164684Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.9164801Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.9164918Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.9165003Z ) 2025-05-07T20:33:07.9165262Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.9165357Z def test_silu_mul_quant( 2025-05-07T20:33:07.9165441Z self, 2025-05-07T20:33:07.9165519Z T: int, 2025-05-07T20:33:07.9165601Z D: int, 2025-05-07T20:33:07.9165698Z scale_ub: Optional[float], 2025-05-07T20:33:07.9165789Z contiguous: bool, 2025-05-07T20:33:07.9165879Z compiled: bool, 2025-05-07T20:33:07.9165958Z ) -> None: 2025-05-07T20:33:07.9166054Z torch.manual_seed(2025) 2025-05-07T20:33:07.9166135Z 2025-05-07T20:33:07.9166308Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.9168303Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:07.9168366Z 2025-05-07T20:33:07.9168523Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:07.9168528Z 2025-05-07T20:33:07.9168631Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.9168872Z self=, 2025-05-07T20:33:07.9168956Z T=2048, 2025-05-07T20:33:07.9169043Z D=5120, 2025-05-07T20:33:07.9169129Z scale_ub=1200.0, 2025-05-07T20:33:07.9169222Z contiguous=False, 2025-05-07T20:33:07.9169317Z compiled=False, 2025-05-07T20:33:07.9169395Z ) 2025-05-07T20:33:07.9169627Z self = 2025-05-07T20:33:07.9169822Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:07.9169827Z 2025-05-07T20:33:07.9169907Z @given( 2025-05-07T20:33:07.9170030Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.9170138Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.9170256Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.9170386Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.9170507Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.9170586Z ) 2025-05-07T20:33:07.9170855Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.9170953Z def test_silu_mul_quant( 2025-05-07T20:33:07.9171033Z self, 2025-05-07T20:33:07.9171120Z T: int, 2025-05-07T20:33:07.9171199Z D: int, 2025-05-07T20:33:07.9171301Z scale_ub: Optional[float], 2025-05-07T20:33:07.9171402Z contiguous: bool, 2025-05-07T20:33:07.9171494Z compiled: bool, 2025-05-07T20:33:07.9171575Z ) -> None: 2025-05-07T20:33:07.9171680Z torch.manual_seed(2025) 2025-05-07T20:33:07.9171757Z 2025-05-07T20:33:07.9171941Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.9173982Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:07.9173991Z 2025-05-07T20:33:07.9174118Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:07.9174122Z 2025-05-07T20:33:07.9174226Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.9174457Z self=, 2025-05-07T20:33:07.9174536Z T=4096, 2025-05-07T20:33:07.9174612Z D=7168, 2025-05-07T20:33:07.9174695Z scale_ub=1200.0, 2025-05-07T20:33:07.9174786Z contiguous=True, 2025-05-07T20:33:07.9174877Z compiled=False, 2025-05-07T20:33:07.9174951Z ) 2025-05-07T20:33:07.9175186Z self = 2025-05-07T20:33:07.9175368Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:07.9175372Z 2025-05-07T20:33:07.9175461Z @given( 2025-05-07T20:33:07.9175582Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.9175684Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.9175807Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.9175972Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.9176148Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.9176230Z ) 2025-05-07T20:33:07.9176491Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.9176586Z def test_silu_mul_quant( 2025-05-07T20:33:07.9176667Z self, 2025-05-07T20:33:07.9176742Z T: int, 2025-05-07T20:33:07.9176865Z D: int, 2025-05-07T20:33:07.9176965Z scale_ub: Optional[float], 2025-05-07T20:33:07.9177055Z contiguous: bool, 2025-05-07T20:33:07.9177147Z compiled: bool, 2025-05-07T20:33:07.9177233Z ) -> None: 2025-05-07T20:33:07.9177333Z torch.manual_seed(2025) 2025-05-07T20:33:07.9177418Z 2025-05-07T20:33:07.9177595Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.9179559Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:07.9179571Z 2025-05-07T20:33:07.9179688Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:07.9179692Z 2025-05-07T20:33:07.9179795Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.9180034Z self=, 2025-05-07T20:33:07.9180112Z T=16384, 2025-05-07T20:33:07.9180200Z D=7168, 2025-05-07T20:33:07.9180282Z scale_ub=None, 2025-05-07T20:33:07.9180369Z contiguous=False, 2025-05-07T20:33:07.9180459Z compiled=True, 2025-05-07T20:33:07.9180534Z ) 2025-05-07T20:33:07.9180758Z self = 2025-05-07T20:33:07.9180949Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:07.9180954Z 2025-05-07T20:33:07.9181037Z @given( 2025-05-07T20:33:07.9181158Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.9181265Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.9181385Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.9181511Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.9181673Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.9181752Z ) 2025-05-07T20:33:07.9182017Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.9182108Z def test_silu_mul_quant( 2025-05-07T20:33:07.9182188Z self, 2025-05-07T20:33:07.9182273Z T: int, 2025-05-07T20:33:07.9182356Z D: int, 2025-05-07T20:33:07.9182457Z scale_ub: Optional[float], 2025-05-07T20:33:07.9182557Z contiguous: bool, 2025-05-07T20:33:07.9182646Z compiled: bool, 2025-05-07T20:33:07.9182726Z ) -> None: 2025-05-07T20:33:07.9183293Z torch.manual_seed(2025) 2025-05-07T20:33:07.9183393Z 2025-05-07T20:33:07.9183573Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.9185707Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:07.9185714Z 2025-05-07T20:33:07.9185837Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:07.9185903Z 2025-05-07T20:33:07.9186012Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.9186246Z self=, 2025-05-07T20:33:07.9186329Z T=4096, 2025-05-07T20:33:07.9186410Z D=7168, 2025-05-07T20:33:07.9186497Z scale_ub=None, 2025-05-07T20:33:07.9186655Z contiguous=True, 2025-05-07T20:33:07.9186741Z compiled=False, 2025-05-07T20:33:07.9186814Z ) 2025-05-07T20:33:07.9187049Z self = 2025-05-07T20:33:07.9187226Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:07.9187230Z 2025-05-07T20:33:07.9187310Z @given( 2025-05-07T20:33:07.9187426Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.9187524Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.9187651Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.9187770Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.9187881Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.9187957Z ) 2025-05-07T20:33:07.9188212Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.9188312Z def test_silu_mul_quant( 2025-05-07T20:33:07.9188402Z self, 2025-05-07T20:33:07.9188482Z T: int, 2025-05-07T20:33:07.9188566Z D: int, 2025-05-07T20:33:07.9188665Z scale_ub: Optional[float], 2025-05-07T20:33:07.9188757Z contiguous: bool, 2025-05-07T20:33:07.9188851Z compiled: bool, 2025-05-07T20:33:07.9188931Z ) -> None: 2025-05-07T20:33:07.9189027Z torch.manual_seed(2025) 2025-05-07T20:33:07.9189106Z 2025-05-07T20:33:07.9189279Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.9191339Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:07.9191350Z 2025-05-07T20:33:07.9191465Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:07.9191538Z 2025-05-07T20:33:07.9191644Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.9191908Z self=, 2025-05-07T20:33:07.9192014Z T=16384, 2025-05-07T20:33:07.9192122Z D=7168, 2025-05-07T20:33:07.9192403Z scale_ub=None, 2025-05-07T20:33:07.9192713Z contiguous=True, 2025-05-07T20:33:07.9193072Z compiled=False, 2025-05-07T20:33:07.9193277Z ) 2025-05-07T20:33:07.9193603Z self = 2025-05-07T20:33:07.9194131Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:07.9194427Z 2025-05-07T20:33:07.9194504Z @given( 2025-05-07T20:33:07.9194739Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.9195068Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.9195379Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.9195724Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.9196064Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.9196374Z ) 2025-05-07T20:33:07.9196740Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.9197211Z def test_silu_mul_quant( 2025-05-07T20:33:07.9197618Z self, 2025-05-07T20:33:07.9197826Z T: int, 2025-05-07T20:33:07.9198061Z D: int, 2025-05-07T20:33:07.9198279Z scale_ub: Optional[float], 2025-05-07T20:33:07.9198555Z contiguous: bool, 2025-05-07T20:33:07.9198796Z compiled: bool, 2025-05-07T20:33:07.9199017Z ) -> None: 2025-05-07T20:33:07.9199236Z torch.manual_seed(2025) 2025-05-07T20:33:07.9199485Z 2025-05-07T20:33:07.9199757Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.9202057Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:07.9204138Z 2025-05-07T20:33:07.9204253Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:07.9204471Z 2025-05-07T20:33:07.9204581Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.9205016Z self=, 2025-05-07T20:33:07.9205438Z T=16384, 2025-05-07T20:33:07.9205637Z D=7168, 2025-05-07T20:33:07.9205828Z scale_ub=1200.0, 2025-05-07T20:33:07.9206047Z contiguous=True, 2025-05-07T20:33:07.9206273Z compiled=False, 2025-05-07T20:33:07.9206477Z ) 2025-05-07T20:33:07.9206808Z self = 2025-05-07T20:33:07.9207334Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:07.9207630Z 2025-05-07T20:33:07.9207713Z @given( 2025-05-07T20:33:07.9207938Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.9208270Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.9208595Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.9208948Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.9209287Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.9209591Z ) 2025-05-07T20:33:07.9209961Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.9210425Z def test_silu_mul_quant( 2025-05-07T20:33:07.9210671Z self, 2025-05-07T20:33:07.9210862Z T: int, 2025-05-07T20:33:07.9211052Z D: int, 2025-05-07T20:33:07.9211317Z scale_ub: Optional[float], 2025-05-07T20:33:07.9211594Z contiguous: bool, 2025-05-07T20:33:07.9211831Z compiled: bool, 2025-05-07T20:33:07.9212051Z ) -> None: 2025-05-07T20:33:07.9212275Z torch.manual_seed(2025) 2025-05-07T20:33:07.9212521Z 2025-05-07T20:33:07.9212801Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.9215045Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:07.9217106Z 2025-05-07T20:33:07.9217225Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:07.9217445Z 2025-05-07T20:33:07.9217556Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.9217979Z self=, 2025-05-07T20:33:07.9218404Z T=128, 2025-05-07T20:33:07.9218635Z D=5120, 2025-05-07T20:33:07.9218823Z scale_ub=1200.0, 2025-05-07T20:33:07.9219091Z contiguous=False, 2025-05-07T20:33:07.9219318Z compiled=False, 2025-05-07T20:33:07.9219513Z ) 2025-05-07T20:33:07.9219847Z self = 2025-05-07T20:33:07.9220377Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:07.9220666Z 2025-05-07T20:33:07.9220792Z @given( 2025-05-07T20:33:07.9221015Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.9221334Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.9221656Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.9221990Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.9222333Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.9222628Z ) 2025-05-07T20:33:07.9222981Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.9223453Z def test_silu_mul_quant( 2025-05-07T20:33:07.9223699Z self, 2025-05-07T20:33:07.9223896Z T: int, 2025-05-07T20:33:07.9224085Z D: int, 2025-05-07T20:33:07.9224305Z scale_ub: Optional[float], 2025-05-07T20:33:07.9224585Z contiguous: bool, 2025-05-07T20:33:07.9224822Z compiled: bool, 2025-05-07T20:33:07.9225054Z ) -> None: 2025-05-07T20:33:07.9225277Z torch.manual_seed(2025) 2025-05-07T20:33:07.9225529Z 2025-05-07T20:33:07.9225810Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.9226166Z 2025-05-07T20:33:07.9226356Z x_sign = torch.sign(x) 2025-05-07T20:33:07.9226654Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:07.9226976Z x = x_sign * x_clamp 2025-05-07T20:33:07.9227223Z x0 = x[:, :D] 2025-05-07T20:33:07.9227441Z x1 = x[:, D:] 2025-05-07T20:33:07.9227654Z 2025-05-07T20:33:07.9227835Z if contiguous: 2025-05-07T20:33:07.9228078Z x0 = x0.contiguous() 2025-05-07T20:33:07.9228354Z x1 = x1.contiguous() 2025-05-07T20:33:07.9228599Z 2025-05-07T20:33:07.9228794Z if scale_ub is not None: 2025-05-07T20:33:07.9229075Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:07.9229421Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:07.9229735Z ) 2025-05-07T20:33:07.9230072Z else: 2025-05-07T20:33:07.9230287Z scale_ub_tensor = None 2025-05-07T20:33:07.9230545Z 2025-05-07T20:33:07.9230776Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:07.9231153Z op = silu_mul_quant 2025-05-07T20:33:07.9231403Z if compiled: 2025-05-07T20:33:07.9231654Z op = torch.compile(op) 2025-05-07T20:33:07.9231961Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.9232238Z 2025-05-07T20:33:07.9232429Z > y_fp8, y_scale = fn() 2025-05-07T20:33:07.9232601Z 2025-05-07T20:33:07.9232703Z moe/activation_test.py:117: 2025-05-07T20:33:07.9233006Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.9233353Z moe/activation_test.py:115: in fn 2025-05-07T20:33:07.9233639Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.9234380Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:07.9235123Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:07.9235699Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:07.9236438Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:07.9237143Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:07.9237711Z kernel = self.compile( 2025-05-07T20:33:07.9238362Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:07.9239109Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:07.9239517Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.9239771Z 2025-05-07T20:33:07.9239988Z self = 2025-05-07T20:33:07.9241210Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:07.9242720Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd1b79c5ca0>} 2025-05-07T20:33:07.9244190Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:07.9245295Z context = 2025-05-07T20:33:07.9245606Z 2025-05-07T20:33:07.9245777Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:07.9246330Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:07.9246821Z module_map=module_map) 2025-05-07T20:33:07.9247199Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:07.9247567Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:07.9247833Z E ^ 2025-05-07T20:33:07.9248324Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:07.9248820Z 2025-05-07T20:33:07.9249271Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:07.9249830Z 2025-05-07T20:33:07.9249940Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.9250372Z self=, 2025-05-07T20:33:07.9250790Z T=2048, 2025-05-07T20:33:07.9250978Z D=7168, 2025-05-07T20:33:07.9251168Z scale_ub=None, 2025-05-07T20:33:07.9251381Z contiguous=False, 2025-05-07T20:33:07.9251609Z compiled=False, 2025-05-07T20:33:07.9251810Z ) 2025-05-07T20:33:07.9252129Z self = 2025-05-07T20:33:07.9252712Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:07.9253047Z 2025-05-07T20:33:07.9253127Z @given( 2025-05-07T20:33:07.9253352Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.9253676Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.9253993Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.9254333Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.9254669Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.9254960Z ) 2025-05-07T20:33:07.9255330Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.9255790Z def test_silu_mul_quant( 2025-05-07T20:33:07.9256036Z self, 2025-05-07T20:33:07.9256231Z T: int, 2025-05-07T20:33:07.9256425Z D: int, 2025-05-07T20:33:07.9256641Z scale_ub: Optional[float], 2025-05-07T20:33:07.9256915Z contiguous: bool, 2025-05-07T20:33:07.9257150Z compiled: bool, 2025-05-07T20:33:07.9257371Z ) -> None: 2025-05-07T20:33:07.9257583Z torch.manual_seed(2025) 2025-05-07T20:33:07.9257819Z 2025-05-07T20:33:07.9258101Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.9260394Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:07.9262528Z 2025-05-07T20:33:07.9262650Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:07.9262868Z 2025-05-07T20:33:07.9262976Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.9263395Z self=, 2025-05-07T20:33:07.9263821Z T=128, 2025-05-07T20:33:07.9264009Z D=7168, 2025-05-07T20:33:07.9264192Z scale_ub=1200.0, 2025-05-07T20:33:07.9264423Z contiguous=True, 2025-05-07T20:33:07.9264655Z compiled=True, 2025-05-07T20:33:07.9264862Z ) 2025-05-07T20:33:07.9265201Z self = 2025-05-07T20:33:07.9265727Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:07.9266013Z 2025-05-07T20:33:07.9266100Z @given( 2025-05-07T20:33:07.9266322Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.9266651Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.9266967Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.9267301Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.9267643Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.9267942Z ) 2025-05-07T20:33:07.9268306Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.9268776Z def test_silu_mul_quant( 2025-05-07T20:33:07.9269019Z self, 2025-05-07T20:33:07.9269220Z T: int, 2025-05-07T20:33:07.9269410Z D: int, 2025-05-07T20:33:07.9269631Z scale_ub: Optional[float], 2025-05-07T20:33:07.9270005Z contiguous: bool, 2025-05-07T20:33:07.9270240Z compiled: bool, 2025-05-07T20:33:07.9270467Z ) -> None: 2025-05-07T20:33:07.9270690Z torch.manual_seed(2025) 2025-05-07T20:33:07.9270936Z 2025-05-07T20:33:07.9271224Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.9277119Z 2025-05-07T20:33:07.9277340Z x_sign = torch.sign(x) 2025-05-07T20:33:07.9277659Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:07.9278101Z x = x_sign * x_clamp 2025-05-07T20:33:07.9278370Z x0 = x[:, :D] 2025-05-07T20:33:07.9278594Z x1 = x[:, D:] 2025-05-07T20:33:07.9278823Z 2025-05-07T20:33:07.9279032Z if contiguous: 2025-05-07T20:33:07.9279275Z x0 = x0.contiguous() 2025-05-07T20:33:07.9279558Z x1 = x1.contiguous() 2025-05-07T20:33:07.9279820Z 2025-05-07T20:33:07.9280027Z if scale_ub is not None: 2025-05-07T20:33:07.9280327Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:07.9280691Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:07.9281016Z ) 2025-05-07T20:33:07.9281224Z else: 2025-05-07T20:33:07.9281448Z scale_ub_tensor = None 2025-05-07T20:33:07.9281710Z 2025-05-07T20:33:07.9281958Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:07.9282301Z op = silu_mul_quant 2025-05-07T20:33:07.9282573Z if compiled: 2025-05-07T20:33:07.9283318Z op = torch.compile(op) 2025-05-07T20:33:07.9283672Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.9283970Z 2025-05-07T20:33:07.9284164Z > y_fp8, y_scale = fn() 2025-05-07T20:33:07.9284346Z 2025-05-07T20:33:07.9284450Z moe/activation_test.py:117: 2025-05-07T20:33:07.9284918Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.9285273Z moe/activation_test.py:115: in fn 2025-05-07T20:33:07.9285637Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.9286244Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:07.9286857Z return fn(*args, **kwargs) 2025-05-07T20:33:07.9287569Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:07.9288393Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:07.9288977Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:07.9289712Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:07.9290428Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:07.9291000Z kernel = self.compile( 2025-05-07T20:33:07.9291578Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:07.9292280Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:07.9292701Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.9292979Z 2025-05-07T20:33:07.9293219Z self = 2025-05-07T20:33:07.9294406Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:07.9295940Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd1b793e0d0>} 2025-05-07T20:33:07.9297416Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:07.9298537Z context = 2025-05-07T20:33:07.9298857Z 2025-05-07T20:33:07.9299027Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:07.9299581Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:07.9300070Z module_map=module_map) 2025-05-07T20:33:07.9300521Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:07.9300898Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:07.9301170Z E ^ 2025-05-07T20:33:07.9301671Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:07.9302177Z 2025-05-07T20:33:07.9302636Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:07.9303244Z 2025-05-07T20:33:07.9303358Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.9303789Z self=, 2025-05-07T20:33:07.9304205Z T=128, 2025-05-07T20:33:07.9304394Z D=7168, 2025-05-07T20:33:07.9304591Z scale_ub=1200.0, 2025-05-07T20:33:07.9304807Z contiguous=True, 2025-05-07T20:33:07.9305030Z compiled=False, 2025-05-07T20:33:07.9305237Z ) 2025-05-07T20:33:07.9305559Z self = 2025-05-07T20:33:07.9306076Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:07.9306361Z 2025-05-07T20:33:07.9306443Z @given( 2025-05-07T20:33:07.9306667Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.9307042Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.9307360Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.9307732Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.9308071Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.9308367Z ) 2025-05-07T20:33:07.9308738Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.9309197Z def test_silu_mul_quant( 2025-05-07T20:33:07.9309486Z self, 2025-05-07T20:33:07.9309683Z T: int, 2025-05-07T20:33:07.9310020Z D: int, 2025-05-07T20:33:07.9310241Z scale_ub: Optional[float], 2025-05-07T20:33:07.9310521Z contiguous: bool, 2025-05-07T20:33:07.9310762Z compiled: bool, 2025-05-07T20:33:07.9310991Z ) -> None: 2025-05-07T20:33:07.9311213Z torch.manual_seed(2025) 2025-05-07T20:33:07.9311453Z 2025-05-07T20:33:07.9311732Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.9312096Z 2025-05-07T20:33:07.9312286Z x_sign = torch.sign(x) 2025-05-07T20:33:07.9312594Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:07.9314859Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:07.9316917Z 2025-05-07T20:33:07.9317033Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:07.9317254Z 2025-05-07T20:33:07.9317365Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.9317790Z self=, 2025-05-07T20:33:07.9318222Z T=128, 2025-05-07T20:33:07.9318418Z D=5120, 2025-05-07T20:33:07.9318610Z scale_ub=1200.0, 2025-05-07T20:33:07.9318844Z contiguous=True, 2025-05-07T20:33:07.9319073Z compiled=True, 2025-05-07T20:33:07.9319274Z ) 2025-05-07T20:33:07.9319611Z self = 2025-05-07T20:33:07.9320135Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:07.9320423Z 2025-05-07T20:33:07.9320505Z @given( 2025-05-07T20:33:07.9320731Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.9321143Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.9321460Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.9321792Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.9322130Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.9322425Z ) 2025-05-07T20:33:07.9322784Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.9323260Z def test_silu_mul_quant( 2025-05-07T20:33:07.9323505Z self, 2025-05-07T20:33:07.9323698Z T: int, 2025-05-07T20:33:07.9323891Z D: int, 2025-05-07T20:33:07.9324111Z scale_ub: Optional[float], 2025-05-07T20:33:07.9324390Z contiguous: bool, 2025-05-07T20:33:07.9324633Z compiled: bool, 2025-05-07T20:33:07.9324866Z ) -> None: 2025-05-07T20:33:07.9325088Z torch.manual_seed(2025) 2025-05-07T20:33:07.9325333Z 2025-05-07T20:33:07.9325619Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.9325984Z 2025-05-07T20:33:07.9326178Z x_sign = torch.sign(x) 2025-05-07T20:33:07.9326481Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:07.9328717Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:07.9330849Z 2025-05-07T20:33:07.9330968Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:07.9331189Z 2025-05-07T20:33:07.9331300Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.9331725Z self=, 2025-05-07T20:33:07.9332146Z T=128, 2025-05-07T20:33:07.9332337Z D=7168, 2025-05-07T20:33:07.9332524Z scale_ub=None, 2025-05-07T20:33:07.9332739Z contiguous=True, 2025-05-07T20:33:07.9332961Z compiled=True, 2025-05-07T20:33:07.9333154Z ) 2025-05-07T20:33:07.9333485Z self = 2025-05-07T20:33:07.9334004Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:07.9334288Z 2025-05-07T20:33:07.9334373Z @given( 2025-05-07T20:33:07.9334601Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.9334927Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.9335246Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.9335580Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.9335923Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.9336220Z ) 2025-05-07T20:33:07.9336583Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.9337055Z def test_silu_mul_quant( 2025-05-07T20:33:07.9337301Z self, 2025-05-07T20:33:07.9337486Z T: int, 2025-05-07T20:33:07.9337690Z D: int, 2025-05-07T20:33:07.9337917Z scale_ub: Optional[float], 2025-05-07T20:33:07.9338196Z contiguous: bool, 2025-05-07T20:33:07.9338441Z compiled: bool, 2025-05-07T20:33:07.9338669Z ) -> None: 2025-05-07T20:33:07.9338883Z torch.manual_seed(2025) 2025-05-07T20:33:07.9339133Z 2025-05-07T20:33:07.9339414Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.9341712Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:07.9343806Z 2025-05-07T20:33:07.9343930Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:07.9344281Z =============================== warnings summary =============================== 2025-05-07T20:33:07.9344855Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:33:07.9345597Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:33:07.9346341Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:33:07.9347726Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:108: DeprecationWarning: warmup, rep, and use_cuda_graph parameters are deprecated. See https://github.com/triton-lang/triton/pull/4496 for details. 2025-05-07T20:33:07.9349028Z warnings.warn(("warmup, rep, and use_cuda_graph parameters are deprecated. See " 2025-05-07T20:33:07.9349439Z 2025-05-07T20:33:07.9349661Z -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html 2025-05-07T20:33:07.9350342Z ================= 1 failed, 1 deselected, 3 warnings in 19.35s ================= 2025-05-07T20:33:09.4854629Z ERROR conda.cli.main_run:execute(125): `conda run python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py` failed. (See above for error) 2025-05-07T20:33:09.5494604Z [EXEC] [ATTEMPT 1/2] Command attempt failed. 2025-05-07T20:33:09.5494867Z 2025-05-07T20:33:11.5513144Z [EXEC] [ATTEMPT 2/2] + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py 2025-05-07T20:33:13.7047227Z ============================= test session starts ============================== 2025-05-07T20:33:13.7047950Z platform linux -- Python 3.9.18, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python 2025-05-07T20:33:13.7048503Z cachedir: .pytest_cache 2025-05-07T20:33:13.7049121Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,) 2025-05-07T20:33:13.7049897Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu 2025-05-07T20:33:13.7050322Z plugins: hypothesis-6.131.14 2025-05-07T20:33:15.3321410Z TMA benchmarks will be running with experimental grid constant TMA descriptor. 2025-05-07T20:33:15.5451918Z collecting ... collected 2 items / 1 deselected / 1 selected 2025-05-07T20:33:15.5452339Z run-last-failure: rerun previous 1 failure 2025-05-07T20:33:15.5452581Z 2025-05-07T20:33:18.2535020Z moe/activation_test.py::ActivationTests::test_silu_mul_quant Trying example: test_silu_mul_quant( 2025-05-07T20:33:18.2537011Z self=, 2025-05-07T20:33:18.2537958Z T=1, 2025-05-07T20:33:18.2538344Z D=5120, 2025-05-07T20:33:18.2538752Z scale_ub=None, 2025-05-07T20:33:18.2539202Z contiguous=True, 2025-05-07T20:33:18.2539653Z compiled=True, 2025-05-07T20:33:18.2540060Z ) 2025-05-07T20:33:18.2540725Z self = 2025-05-07T20:33:18.2541776Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:18.2542339Z 2025-05-07T20:33:18.2542507Z @given( 2025-05-07T20:33:18.2542979Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:18.2543640Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:18.2544719Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:18.2545274Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:18.2545665Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:18.2545973Z ) 2025-05-07T20:33:18.2546343Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:18.2546835Z def test_silu_mul_quant( 2025-05-07T20:33:18.2547095Z self, 2025-05-07T20:33:18.2547301Z T: int, 2025-05-07T20:33:18.2547503Z D: int, 2025-05-07T20:33:18.2547730Z scale_ub: Optional[float], 2025-05-07T20:33:18.2548017Z contiguous: bool, 2025-05-07T20:33:18.2548263Z compiled: bool, 2025-05-07T20:33:18.2548502Z ) -> None: 2025-05-07T20:33:18.2548725Z torch.manual_seed(2025) 2025-05-07T20:33:18.2548978Z 2025-05-07T20:33:18.2549263Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:18.2549660Z 2025-05-07T20:33:18.2550073Z x_sign = torch.sign(x) 2025-05-07T20:33:18.2550374Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:18.2550707Z x = x_sign * x_clamp 2025-05-07T20:33:18.2550957Z x0 = x[:, :D] 2025-05-07T20:33:18.2551176Z x1 = x[:, D:] 2025-05-07T20:33:18.2551388Z 2025-05-07T20:33:18.2551581Z if contiguous: 2025-05-07T20:33:18.2551922Z x0 = x0.contiguous() 2025-05-07T20:33:18.2552204Z x1 = x1.contiguous() 2025-05-07T20:33:18.2552552Z 2025-05-07T20:33:18.2552749Z if scale_ub is not None: 2025-05-07T20:33:18.2553045Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:18.2553405Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:18.2553741Z ) 2025-05-07T20:33:18.2553932Z else: 2025-05-07T20:33:18.2554229Z scale_ub_tensor = None 2025-05-07T20:33:18.2554500Z 2025-05-07T20:33:18.2554732Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:18.2555070Z op = silu_mul_quant 2025-05-07T20:33:18.2555333Z if compiled: 2025-05-07T20:33:18.2555589Z op = torch.compile(op) 2025-05-07T20:33:18.2555903Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:18.2556198Z 2025-05-07T20:33:18.2556392Z y_fp8, y_scale = fn() 2025-05-07T20:33:18.2556694Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:18.2557005Z 2025-05-07T20:33:18.2557244Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:18.2557602Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:18.2557910Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:18.2558238Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:18.2558621Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:18.2558961Z 2025-05-07T20:33:18.2559167Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:18.2559374Z 2025-05-07T20:33:18.2559476Z moe/activation_test.py:126: 2025-05-07T20:33:18.2559795Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:18.2560157Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:18.2560494Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:18.2561355Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:18.2562186Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:18.2562777Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:18.2563512Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:18.2564260Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:18.2565099Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:18.2565962Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:33:18.2566769Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:18.2567555Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:18.2568241Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:18.2568872Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:18.2569424Z fn() 2025-05-07T20:33:18.2569960Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:18.2570577Z self.fn.run( 2025-05-07T20:33:18.2571062Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:18.2571629Z kernel = self.compile( 2025-05-07T20:33:18.2572201Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:18.2572892Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:18.2573365Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:18.2573653Z 2025-05-07T20:33:18.2573866Z self = 2025-05-07T20:33:18.2575047Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:18.2576621Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f59015d99d0>} 2025-05-07T20:33:18.2578089Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:18.2579201Z context = 2025-05-07T20:33:18.2579505Z 2025-05-07T20:33:18.2579683Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:18.2580236Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:18.2580725Z module_map=module_map) 2025-05-07T20:33:18.2581105Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:18.2581472Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:18.2581739Z E ^ 2025-05-07T20:33:18.2582232Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:18.2582722Z 2025-05-07T20:33:18.2583599Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:18.2584156Z 2025-05-07T20:33:18.2584264Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:18.2584693Z self=, 2025-05-07T20:33:18.2585114Z T=2048, 2025-05-07T20:33:18.2585306Z D=5120, 2025-05-07T20:33:18.2585490Z scale_ub=1200.0, 2025-05-07T20:33:18.2585710Z contiguous=True, 2025-05-07T20:33:18.2585934Z compiled=False, 2025-05-07T20:33:18.2586131Z ) 2025-05-07T20:33:19.7634156Z self = 2025-05-07T20:33:19.7634971Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:19.7635369Z 2025-05-07T20:33:19.7635460Z @given( 2025-05-07T20:33:19.7635700Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.7636320Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.7636651Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.7637005Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.7637348Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.7637654Z ) 2025-05-07T20:33:19.7638036Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.7638519Z def test_silu_mul_quant( 2025-05-07T20:33:19.7638775Z self, 2025-05-07T20:33:19.7638976Z T: int, 2025-05-07T20:33:19.7639176Z D: int, 2025-05-07T20:33:19.7639406Z scale_ub: Optional[float], 2025-05-07T20:33:19.7639693Z contiguous: bool, 2025-05-07T20:33:19.7639937Z compiled: bool, 2025-05-07T20:33:19.7640184Z ) -> None: 2025-05-07T20:33:19.7640405Z torch.manual_seed(2025) 2025-05-07T20:33:19.7640656Z 2025-05-07T20:33:19.7640939Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.7641307Z 2025-05-07T20:33:19.7641503Z x_sign = torch.sign(x) 2025-05-07T20:33:19.7641794Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:19.7642122Z x = x_sign * x_clamp 2025-05-07T20:33:19.7642370Z x0 = x[:, :D] 2025-05-07T20:33:19.7642585Z x1 = x[:, D:] 2025-05-07T20:33:19.7642920Z 2025-05-07T20:33:19.7643118Z if contiguous: 2025-05-07T20:33:19.7643439Z x0 = x0.contiguous() 2025-05-07T20:33:19.7643714Z x1 = x1.contiguous() 2025-05-07T20:33:19.7643975Z 2025-05-07T20:33:19.7644168Z if scale_ub is not None: 2025-05-07T20:33:19.7644455Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:19.7644812Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:19.7645218Z ) 2025-05-07T20:33:19.7645420Z else: 2025-05-07T20:33:19.7645641Z scale_ub_tensor = None 2025-05-07T20:33:19.7645906Z 2025-05-07T20:33:19.7646152Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:19.7646491Z op = silu_mul_quant 2025-05-07T20:33:19.7646758Z if compiled: 2025-05-07T20:33:19.7647015Z op = torch.compile(op) 2025-05-07T20:33:19.7647330Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.7647626Z 2025-05-07T20:33:19.7647818Z > y_fp8, y_scale = fn() 2025-05-07T20:33:19.7648001Z 2025-05-07T20:33:19.7648103Z moe/activation_test.py:117: 2025-05-07T20:33:19.7648414Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.7648758Z moe/activation_test.py:115: in fn 2025-05-07T20:33:19.7649047Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.7649797Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:19.7650555Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:19.7651127Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:19.7651867Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:19.7652583Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:19.7653151Z kernel = self.compile( 2025-05-07T20:33:19.7653730Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:19.7654438Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:19.7654857Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.7655099Z 2025-05-07T20:33:19.7655318Z self = 2025-05-07T20:33:19.7656545Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:19.7658071Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f58dd9ad5e0>} 2025-05-07T20:33:19.7659542Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:19.7660649Z context = 2025-05-07T20:33:19.7660963Z 2025-05-07T20:33:19.7667450Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:19.7668069Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:19.7668594Z module_map=module_map) 2025-05-07T20:33:19.7668997Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:19.7669371Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:19.7669650Z E ^ 2025-05-07T20:33:19.7670312Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:19.7670879Z 2025-05-07T20:33:19.7671349Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:19.7671956Z 2025-05-07T20:33:19.7672060Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.7672495Z self=, 2025-05-07T20:33:19.7672923Z T=2048, 2025-05-07T20:33:19.7673110Z D=5120, 2025-05-07T20:33:19.7673357Z scale_ub=1200.0, 2025-05-07T20:33:19.7673592Z contiguous=True, 2025-05-07T20:33:19.7673813Z compiled=True, 2025-05-07T20:33:19.7674027Z ) 2025-05-07T20:33:19.7674368Z self = 2025-05-07T20:33:19.7674898Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:19.7675188Z 2025-05-07T20:33:19.7675269Z @given( 2025-05-07T20:33:19.7675507Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.7675842Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.7676156Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.7676504Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.7676852Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.7677153Z ) 2025-05-07T20:33:19.7677523Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.7677996Z def test_silu_mul_quant( 2025-05-07T20:33:19.7678252Z self, 2025-05-07T20:33:19.7678445Z T: int, 2025-05-07T20:33:19.7678650Z D: int, 2025-05-07T20:33:19.7678875Z scale_ub: Optional[float], 2025-05-07T20:33:19.7679151Z contiguous: bool, 2025-05-07T20:33:19.7679401Z compiled: bool, 2025-05-07T20:33:19.7679631Z ) -> None: 2025-05-07T20:33:19.7679849Z torch.manual_seed(2025) 2025-05-07T20:33:19.7680099Z 2025-05-07T20:33:19.7680380Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.7680732Z 2025-05-07T20:33:19.7680928Z x_sign = torch.sign(x) 2025-05-07T20:33:19.7681227Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:19.7681552Z x = x_sign * x_clamp 2025-05-07T20:33:19.7681793Z x0 = x[:, :D] 2025-05-07T20:33:19.7682016Z x1 = x[:, D:] 2025-05-07T20:33:19.7682231Z 2025-05-07T20:33:19.7682416Z if contiguous: 2025-05-07T20:33:19.7682653Z x0 = x0.contiguous() 2025-05-07T20:33:19.7683197Z x1 = x1.contiguous() 2025-05-07T20:33:19.7683440Z 2025-05-07T20:33:19.7683632Z if scale_ub is not None: 2025-05-07T20:33:19.7684001Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:19.7684352Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:19.7684670Z ) 2025-05-07T20:33:19.7684860Z else: 2025-05-07T20:33:19.7685071Z scale_ub_tensor = None 2025-05-07T20:33:19.7685320Z 2025-05-07T20:33:19.7685559Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:19.7685893Z op = silu_mul_quant 2025-05-07T20:33:19.7686143Z if compiled: 2025-05-07T20:33:19.7686396Z op = torch.compile(op) 2025-05-07T20:33:19.7686703Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.7686982Z 2025-05-07T20:33:19.7687175Z y_fp8, y_scale = fn() 2025-05-07T20:33:19.7687466Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:19.7687765Z 2025-05-07T20:33:19.7688004Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:19.7688356Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:19.7688658Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:19.7688978Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:19.7689356Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:19.7689684Z 2025-05-07T20:33:19.7689957Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:19.7690172Z 2025-05-07T20:33:19.7690332Z moe/activation_test.py:126: 2025-05-07T20:33:19.7690645Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.7690991Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:19.7691331Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:19.7692180Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:19.7693068Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:19.7693646Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:19.7694384Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:19.7695127Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:19.7695909Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:19.7696715Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:33:19.7697521Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:19.7698311Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:19.7698994Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:19.7699639Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:19.7700197Z fn() 2025-05-07T20:33:19.7700735Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:19.7701350Z self.fn.run( 2025-05-07T20:33:19.7701841Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:19.7702410Z kernel = self.compile( 2025-05-07T20:33:19.7702977Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:19.7703674Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:19.7704085Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.7704328Z 2025-05-07T20:33:19.7704544Z self = 2025-05-07T20:33:19.7705756Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:19.7707266Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f59000565e0>} 2025-05-07T20:33:19.7708736Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:19.7709923Z context = 2025-05-07T20:33:19.7710231Z 2025-05-07T20:33:19.7710407Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:19.7710953Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:19.7711447Z module_map=module_map) 2025-05-07T20:33:19.7711822Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:19.7712178Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:19.7712447Z E ^ 2025-05-07T20:33:19.7712983Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:19.7713543Z 2025-05-07T20:33:19.7713999Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:19.7714553Z 2025-05-07T20:33:19.7714650Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.7715079Z self=, 2025-05-07T20:33:19.7715544Z T=16384, 2025-05-07T20:33:19.7715730Z D=7168, 2025-05-07T20:33:19.7715919Z scale_ub=1200.0, 2025-05-07T20:33:19.7716142Z contiguous=False, 2025-05-07T20:33:19.7716364Z compiled=False, 2025-05-07T20:33:19.7716566Z ) 2025-05-07T20:33:21.1012400Z self = 2025-05-07T20:33:21.1013989Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:21.1014835Z 2025-05-07T20:33:21.1015009Z @given( 2025-05-07T20:33:21.1015516Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:21.1016057Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:21.1016429Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:21.1016781Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:21.1017136Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:21.1017435Z ) 2025-05-07T20:33:21.1017814Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:21.1018290Z def test_silu_mul_quant( 2025-05-07T20:33:21.1018541Z self, 2025-05-07T20:33:21.1018748Z T: int, 2025-05-07T20:33:21.1018961Z D: int, 2025-05-07T20:33:21.1019183Z scale_ub: Optional[float], 2025-05-07T20:33:21.1019474Z contiguous: bool, 2025-05-07T20:33:21.1019726Z compiled: bool, 2025-05-07T20:33:21.1019961Z ) -> None: 2025-05-07T20:33:21.1020188Z torch.manual_seed(2025) 2025-05-07T20:33:21.1020448Z 2025-05-07T20:33:21.1020726Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:21.1021102Z 2025-05-07T20:33:21.1021306Z x_sign = torch.sign(x) 2025-05-07T20:33:21.1021613Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:21.1021938Z x = x_sign * x_clamp 2025-05-07T20:33:21.1022195Z x0 = x[:, :D] 2025-05-07T20:33:21.1022425Z x1 = x[:, D:] 2025-05-07T20:33:21.1022641Z 2025-05-07T20:33:21.1022837Z if contiguous: 2025-05-07T20:33:21.1023080Z x0 = x0.contiguous() 2025-05-07T20:33:21.1023347Z x1 = x1.contiguous() 2025-05-07T20:33:21.1023912Z 2025-05-07T20:33:21.1024116Z if scale_ub is not None: 2025-05-07T20:33:21.1024400Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:21.1024756Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:21.1025091Z ) 2025-05-07T20:33:21.1025289Z else: 2025-05-07T20:33:21.1025511Z scale_ub_tensor = None 2025-05-07T20:33:21.1025782Z 2025-05-07T20:33:21.1026016Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:21.1026349Z op = silu_mul_quant 2025-05-07T20:33:21.1026609Z if compiled: 2025-05-07T20:33:21.1026861Z op = torch.compile(op) 2025-05-07T20:33:21.1027175Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:21.1027470Z 2025-05-07T20:33:21.1027670Z > y_fp8, y_scale = fn() 2025-05-07T20:33:21.1027842Z 2025-05-07T20:33:21.1027948Z moe/activation_test.py:117: 2025-05-07T20:33:21.1028264Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:21.1028615Z moe/activation_test.py:115: in fn 2025-05-07T20:33:21.1028900Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:21.1029645Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:21.1030658Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:21.1031325Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:21.1032057Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:21.1032775Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:21.1033426Z kernel = self.compile( 2025-05-07T20:33:21.1033999Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:21.1034703Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:21.1035120Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:21.1035365Z 2025-05-07T20:33:21.1035587Z self = 2025-05-07T20:33:21.1036761Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:21.1038290Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f58fffea280>} 2025-05-07T20:33:21.1039769Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:21.1040877Z context = 2025-05-07T20:33:21.1041180Z 2025-05-07T20:33:21.1041356Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:21.1041901Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:21.1042407Z module_map=module_map) 2025-05-07T20:33:21.1042785Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:21.1043142Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:21.1043411Z E ^ 2025-05-07T20:33:21.1043915Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:21.1044407Z 2025-05-07T20:33:21.1044861Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:21.1045420Z 2025-05-07T20:33:21.1045568Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:21.1045999Z self=, 2025-05-07T20:33:21.1046425Z T=1, 2025-05-07T20:33:21.1046604Z D=7168, 2025-05-07T20:33:21.1046804Z scale_ub=None, 2025-05-07T20:33:21.1047020Z contiguous=True, 2025-05-07T20:33:21.1047237Z compiled=True, 2025-05-07T20:33:21.1047448Z ) 2025-05-07T20:33:21.1047778Z self = 2025-05-07T20:33:21.1048289Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:21.1048559Z 2025-05-07T20:33:21.1048632Z @given( 2025-05-07T20:33:21.1048863Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:21.1049188Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:21.1049500Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:21.1049837Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:21.1050179Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:21.1050466Z ) 2025-05-07T20:33:21.1050822Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:21.1051286Z def test_silu_mul_quant( 2025-05-07T20:33:21.1051527Z self, 2025-05-07T20:33:21.1051719Z T: int, 2025-05-07T20:33:21.1051969Z D: int, 2025-05-07T20:33:21.1052188Z scale_ub: Optional[float], 2025-05-07T20:33:21.1052496Z contiguous: bool, 2025-05-07T20:33:21.1052739Z compiled: bool, 2025-05-07T20:33:21.1052960Z ) -> None: 2025-05-07T20:33:21.1053171Z torch.manual_seed(2025) 2025-05-07T20:33:21.1053421Z 2025-05-07T20:33:21.1053695Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:21.1054086Z 2025-05-07T20:33:21.1054280Z x_sign = torch.sign(x) 2025-05-07T20:33:21.1054574Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:21.1054890Z x = x_sign * x_clamp 2025-05-07T20:33:21.1055134Z x0 = x[:, :D] 2025-05-07T20:33:21.1055353Z x1 = x[:, D:] 2025-05-07T20:33:21.1055557Z 2025-05-07T20:33:21.1055750Z if contiguous: 2025-05-07T20:33:21.1055989Z x0 = x0.contiguous() 2025-05-07T20:33:21.1056245Z x1 = x1.contiguous() 2025-05-07T20:33:21.1056494Z 2025-05-07T20:33:21.1056687Z if scale_ub is not None: 2025-05-07T20:33:21.1056960Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:21.1057308Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:21.1057627Z ) 2025-05-07T20:33:21.1057819Z else: 2025-05-07T20:33:21.1058022Z scale_ub_tensor = None 2025-05-07T20:33:21.1058282Z 2025-05-07T20:33:21.1058522Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:21.1058838Z op = silu_mul_quant 2025-05-07T20:33:21.1059092Z if compiled: 2025-05-07T20:33:21.1059342Z op = torch.compile(op) 2025-05-07T20:33:21.1059643Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:21.1059932Z 2025-05-07T20:33:21.1060129Z y_fp8, y_scale = fn() 2025-05-07T20:33:21.1060414Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:21.1060718Z 2025-05-07T20:33:21.1060962Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:21.1061300Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:21.1061607Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:21.1061930Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:21.1062302Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:21.1062619Z 2025-05-07T20:33:21.1062819Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:21.1063021Z 2025-05-07T20:33:21.1063124Z moe/activation_test.py:126: 2025-05-07T20:33:21.1063423Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:21.1063826Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:21.1064161Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:21.1065000Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:21.1065821Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:21.1066450Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:21.1067188Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:21.1067920Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:21.1068696Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:21.1069511Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:33:21.1070415Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:21.1071201Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:21.1071944Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:21.1072592Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:21.1073194Z fn() 2025-05-07T20:33:21.1073725Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:21.1074357Z self.fn.run( 2025-05-07T20:33:21.1074856Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:21.1075463Z kernel = self.compile( 2025-05-07T20:33:21.1076044Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:21.1076747Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:21.1077163Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:21.1077405Z 2025-05-07T20:33:21.1077619Z self = 2025-05-07T20:33:21.1078799Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:21.1080313Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f58fffff940>} 2025-05-07T20:33:21.1081796Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:21.1083198Z context = 2025-05-07T20:33:21.1083512Z 2025-05-07T20:33:21.1083685Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:21.1084238Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:21.1084735Z module_map=module_map) 2025-05-07T20:33:21.1085112Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:21.1085476Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:21.1085758Z E ^ 2025-05-07T20:33:21.1086244Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:21.1086741Z 2025-05-07T20:33:21.1087267Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:21.1087828Z 2025-05-07T20:33:21.1087929Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:21.1088364Z self=, 2025-05-07T20:33:21.1088780Z T=4096, 2025-05-07T20:33:21.1088968Z D=5120, 2025-05-07T20:33:21.1089160Z scale_ub=None, 2025-05-07T20:33:21.1089367Z contiguous=False, 2025-05-07T20:33:21.1089599Z compiled=False, 2025-05-07T20:33:21.1089802Z ) 2025-05-07T20:33:22.8425841Z self = 2025-05-07T20:33:22.8426937Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:22.8427332Z 2025-05-07T20:33:22.8427419Z @given( 2025-05-07T20:33:22.8427686Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:22.8428018Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:22.8428342Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:22.8428699Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:22.8429043Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:22.8429355Z ) 2025-05-07T20:33:22.8429732Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:22.8430673Z def test_silu_mul_quant( 2025-05-07T20:33:22.8430930Z self, 2025-05-07T20:33:22.8431216Z T: int, 2025-05-07T20:33:22.8431418Z D: int, 2025-05-07T20:33:22.8431635Z scale_ub: Optional[float], 2025-05-07T20:33:22.8431912Z contiguous: bool, 2025-05-07T20:33:22.8432161Z compiled: bool, 2025-05-07T20:33:22.8432385Z ) -> None: 2025-05-07T20:33:22.8432600Z torch.manual_seed(2025) 2025-05-07T20:33:22.8432933Z 2025-05-07T20:33:22.8433204Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:22.8433561Z 2025-05-07T20:33:22.8433755Z x_sign = torch.sign(x) 2025-05-07T20:33:22.8434046Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:22.8434364Z x = x_sign * x_clamp 2025-05-07T20:33:22.8434607Z x0 = x[:, :D] 2025-05-07T20:33:22.8434816Z x1 = x[:, D:] 2025-05-07T20:33:22.8435025Z 2025-05-07T20:33:22.8435211Z if contiguous: 2025-05-07T20:33:22.8435447Z x0 = x0.contiguous() 2025-05-07T20:33:22.8435705Z x1 = x1.contiguous() 2025-05-07T20:33:22.8435951Z 2025-05-07T20:33:22.8436144Z if scale_ub is not None: 2025-05-07T20:33:22.8436415Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:22.8436758Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:22.8437078Z ) 2025-05-07T20:33:22.8437266Z else: 2025-05-07T20:33:22.8437484Z scale_ub_tensor = None 2025-05-07T20:33:22.8437741Z 2025-05-07T20:33:22.8437966Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:22.8438292Z op = silu_mul_quant 2025-05-07T20:33:22.8438549Z if compiled: 2025-05-07T20:33:22.8438795Z op = torch.compile(op) 2025-05-07T20:33:22.8439100Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:22.8439387Z 2025-05-07T20:33:22.8439570Z > y_fp8, y_scale = fn() 2025-05-07T20:33:22.8439744Z 2025-05-07T20:33:22.8439845Z moe/activation_test.py:117: 2025-05-07T20:33:22.8440152Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:22.8440502Z moe/activation_test.py:115: in fn 2025-05-07T20:33:22.8440782Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:22.8441528Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:22.8442286Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:22.8442848Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:22.8443674Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:22.8444383Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:22.8444948Z kernel = self.compile( 2025-05-07T20:33:22.8445513Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:22.8446215Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:22.8446627Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:22.8446867Z 2025-05-07T20:33:22.8447085Z self = 2025-05-07T20:33:22.8448245Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:22.8449766Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f58ffeb8430>} 2025-05-07T20:33:22.8451272Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:22.8452410Z context = 2025-05-07T20:33:22.8452713Z 2025-05-07T20:33:22.8452879Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:22.8453425Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:22.8453960Z module_map=module_map) 2025-05-07T20:33:22.8454337Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:22.8454694Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:22.8454957Z E ^ 2025-05-07T20:33:22.8461235Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:22.8461778Z 2025-05-07T20:33:22.8462255Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:22.8462835Z 2025-05-07T20:33:22.8462952Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:22.8463399Z self=, 2025-05-07T20:33:22.8463842Z T=4096, 2025-05-07T20:33:22.8464041Z D=7168, 2025-05-07T20:33:22.8464253Z scale_ub=None, 2025-05-07T20:33:22.8464489Z contiguous=False, 2025-05-07T20:33:22.8464735Z compiled=False, 2025-05-07T20:33:22.8464958Z ) 2025-05-07T20:33:22.8465299Z self = 2025-05-07T20:33:22.8465828Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:22.8466129Z 2025-05-07T20:33:22.8466211Z @given( 2025-05-07T20:33:22.8466455Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:22.8466781Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:22.8467107Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:22.8467461Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:22.8467813Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:22.8468113Z ) 2025-05-07T20:33:22.8468487Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:22.8468966Z def test_silu_mul_quant( 2025-05-07T20:33:22.8469216Z self, 2025-05-07T20:33:22.8469423Z T: int, 2025-05-07T20:33:22.8469638Z D: int, 2025-05-07T20:33:22.8470033Z scale_ub: Optional[float], 2025-05-07T20:33:22.8470322Z contiguous: bool, 2025-05-07T20:33:22.8470579Z compiled: bool, 2025-05-07T20:33:22.8470887Z ) -> None: 2025-05-07T20:33:22.8471123Z torch.manual_seed(2025) 2025-05-07T20:33:22.8471381Z 2025-05-07T20:33:22.8471661Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:22.8472030Z 2025-05-07T20:33:22.8472238Z x_sign = torch.sign(x) 2025-05-07T20:33:22.8472543Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:22.8472878Z x = x_sign * x_clamp 2025-05-07T20:33:22.8473129Z x0 = x[:, :D] 2025-05-07T20:33:22.8473362Z x1 = x[:, D:] 2025-05-07T20:33:22.8473570Z 2025-05-07T20:33:22.8473766Z if contiguous: 2025-05-07T20:33:22.8474004Z x0 = x0.contiguous() 2025-05-07T20:33:22.8474273Z x1 = x1.contiguous() 2025-05-07T20:33:22.8474520Z 2025-05-07T20:33:22.8474718Z if scale_ub is not None: 2025-05-07T20:33:22.8475001Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:22.8475345Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:22.8475671Z ) 2025-05-07T20:33:22.8475865Z else: 2025-05-07T20:33:22.8476068Z scale_ub_tensor = None 2025-05-07T20:33:22.8476349Z 2025-05-07T20:33:22.8476616Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:22.8476938Z op = silu_mul_quant 2025-05-07T20:33:22.8477249Z if compiled: 2025-05-07T20:33:22.8477505Z op = torch.compile(op) 2025-05-07T20:33:22.8477850Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:22.8478141Z 2025-05-07T20:33:22.8478336Z > y_fp8, y_scale = fn() 2025-05-07T20:33:22.8478504Z 2025-05-07T20:33:22.8478609Z moe/activation_test.py:117: 2025-05-07T20:33:22.8478908Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:22.8479307Z moe/activation_test.py:115: in fn 2025-05-07T20:33:22.8479598Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:22.8480341Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:22.8481090Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:22.8481671Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:22.8482410Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:22.8483578Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:22.8484151Z kernel = self.compile( 2025-05-07T20:33:22.8484724Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:22.8485420Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:22.8485840Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:22.8486090Z 2025-05-07T20:33:22.8486304Z self = 2025-05-07T20:33:22.8487476Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:22.8488981Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f58ffbdedc0>} 2025-05-07T20:33:22.8490457Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:22.8491570Z context = 2025-05-07T20:33:22.8491874Z 2025-05-07T20:33:22.8492052Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:22.8492694Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:22.8493187Z module_map=module_map) 2025-05-07T20:33:22.8493568Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:22.8493934Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:22.8494201Z E ^ 2025-05-07T20:33:22.8494697Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:22.8495189Z 2025-05-07T20:33:22.8495644Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:22.8496200Z 2025-05-07T20:33:22.8496310Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:22.8496737Z self=, 2025-05-07T20:33:22.8497167Z T=128, 2025-05-07T20:33:22.8497355Z D=7168, 2025-05-07T20:33:22.8497548Z scale_ub=None, 2025-05-07T20:33:22.8497771Z contiguous=False, 2025-05-07T20:33:22.8498002Z compiled=True, 2025-05-07T20:33:22.8498199Z ) 2025-05-07T20:33:22.9256776Z self = 2025-05-07T20:33:22.9257694Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:22.9258063Z 2025-05-07T20:33:22.9258144Z @given( 2025-05-07T20:33:22.9258473Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:22.9258794Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:22.9259111Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:22.9259446Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:22.9259782Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:22.9260154Z ) 2025-05-07T20:33:22.9260511Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:22.9260978Z def test_silu_mul_quant( 2025-05-07T20:33:22.9261222Z self, 2025-05-07T20:33:22.9261408Z T: int, 2025-05-07T20:33:22.9261604Z D: int, 2025-05-07T20:33:22.9261819Z scale_ub: Optional[float], 2025-05-07T20:33:22.9262093Z contiguous: bool, 2025-05-07T20:33:22.9262335Z compiled: bool, 2025-05-07T20:33:22.9262569Z ) -> None: 2025-05-07T20:33:22.9262780Z torch.manual_seed(2025) 2025-05-07T20:33:22.9263026Z 2025-05-07T20:33:22.9263301Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:22.9263655Z 2025-05-07T20:33:22.9263842Z x_sign = torch.sign(x) 2025-05-07T20:33:22.9264136Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:22.9264451Z x = x_sign * x_clamp 2025-05-07T20:33:22.9264689Z x0 = x[:, :D] 2025-05-07T20:33:22.9264906Z x1 = x[:, D:] 2025-05-07T20:33:22.9265119Z 2025-05-07T20:33:22.9265297Z if contiguous: 2025-05-07T20:33:22.9265531Z x0 = x0.contiguous() 2025-05-07T20:33:22.9265798Z x1 = x1.contiguous() 2025-05-07T20:33:22.9266036Z 2025-05-07T20:33:22.9266229Z if scale_ub is not None: 2025-05-07T20:33:22.9266509Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:22.9266848Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:22.9267169Z ) 2025-05-07T20:33:22.9267364Z else: 2025-05-07T20:33:22.9267574Z scale_ub_tensor = None 2025-05-07T20:33:22.9267838Z 2025-05-07T20:33:22.9268076Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:22.9268401Z op = silu_mul_quant 2025-05-07T20:33:22.9268650Z if compiled: 2025-05-07T20:33:22.9268902Z op = torch.compile(op) 2025-05-07T20:33:22.9269210Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:22.9269490Z 2025-05-07T20:33:22.9269684Z y_fp8, y_scale = fn() 2025-05-07T20:33:22.9270235Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:22.9270533Z 2025-05-07T20:33:22.9270771Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:22.9271122Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:22.9271416Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:22.9271739Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:22.9272112Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:22.9272432Z 2025-05-07T20:33:22.9272628Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:22.9272835Z 2025-05-07T20:33:22.9272931Z moe/activation_test.py:126: 2025-05-07T20:33:22.9273238Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:22.9273585Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:22.9273924Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:22.9274779Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:22.9275588Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:22.9276166Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:22.9276954Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:22.9277732Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:22.9278500Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:22.9279316Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:33:22.9280187Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:22.9280973Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:22.9281655Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:22.9282302Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:22.9283118Z fn() 2025-05-07T20:33:22.9283656Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:22.9284283Z self.fn.run( 2025-05-07T20:33:22.9284771Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:22.9285335Z kernel = self.compile( 2025-05-07T20:33:22.9285903Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:22.9286606Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:22.9287023Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:22.9287266Z 2025-05-07T20:33:22.9287485Z self = 2025-05-07T20:33:22.9288650Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:22.9290169Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f58ff752160>} 2025-05-07T20:33:22.9291638Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:22.9292756Z context = 2025-05-07T20:33:22.9293061Z 2025-05-07T20:33:22.9293305Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:22.9293859Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:22.9294353Z module_map=module_map) 2025-05-07T20:33:22.9294734Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:22.9295099Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:22.9295379Z E ^ 2025-05-07T20:33:22.9295871Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:22.9296360Z 2025-05-07T20:33:22.9296808Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:22.9297374Z 2025-05-07T20:33:22.9297475Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:22.9297906Z self=, 2025-05-07T20:33:22.9298337Z T=128, 2025-05-07T20:33:22.9298523Z D=7168, 2025-05-07T20:33:22.9298715Z scale_ub=None, 2025-05-07T20:33:22.9298929Z contiguous=False, 2025-05-07T20:33:22.9299162Z compiled=False, 2025-05-07T20:33:22.9299373Z ) 2025-05-07T20:33:23.3273497Z self = 2025-05-07T20:33:23.3274135Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:23.3274545Z 2025-05-07T20:33:23.3274627Z @given( 2025-05-07T20:33:23.3274860Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:23.3275187Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:23.3275503Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:23.3275932Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:23.3276271Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:23.3276564Z ) 2025-05-07T20:33:23.3276934Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:23.3277398Z def test_silu_mul_quant( 2025-05-07T20:33:23.3277638Z self, 2025-05-07T20:33:23.3277824Z T: int, 2025-05-07T20:33:23.3278019Z D: int, 2025-05-07T20:33:23.3278232Z scale_ub: Optional[float], 2025-05-07T20:33:23.3278501Z contiguous: bool, 2025-05-07T20:33:23.3278741Z compiled: bool, 2025-05-07T20:33:23.3278970Z ) -> None: 2025-05-07T20:33:23.3279178Z torch.manual_seed(2025) 2025-05-07T20:33:23.3279424Z 2025-05-07T20:33:23.3279697Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:23.3280049Z 2025-05-07T20:33:23.3280274Z x_sign = torch.sign(x) 2025-05-07T20:33:23.3280561Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:23.3280888Z x = x_sign * x_clamp 2025-05-07T20:33:23.3281131Z x0 = x[:, :D] 2025-05-07T20:33:23.3281343Z x1 = x[:, D:] 2025-05-07T20:33:23.3281553Z 2025-05-07T20:33:23.3281736Z if contiguous: 2025-05-07T20:33:23.3281962Z x0 = x0.contiguous() 2025-05-07T20:33:23.3282222Z x1 = x1.contiguous() 2025-05-07T20:33:23.3282465Z 2025-05-07T20:33:23.3282655Z if scale_ub is not None: 2025-05-07T20:33:23.3283231Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:23.3283575Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:23.3283898Z ) 2025-05-07T20:33:23.3284079Z else: 2025-05-07T20:33:23.3284288Z scale_ub_tensor = None 2025-05-07T20:33:23.3284545Z 2025-05-07T20:33:23.3284769Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:23.3285101Z op = silu_mul_quant 2025-05-07T20:33:23.3285363Z if compiled: 2025-05-07T20:33:23.3285605Z op = torch.compile(op) 2025-05-07T20:33:23.3285909Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:23.3286194Z 2025-05-07T20:33:23.3286471Z > y_fp8, y_scale = fn() 2025-05-07T20:33:23.3286650Z 2025-05-07T20:33:23.3286749Z moe/activation_test.py:117: 2025-05-07T20:33:23.3287052Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:23.3287398Z moe/activation_test.py:115: in fn 2025-05-07T20:33:23.3287683Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:23.3288421Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:23.3289165Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:23.3289726Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:23.3290457Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:23.3291169Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:23.3291738Z kernel = self.compile( 2025-05-07T20:33:23.3292303Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:23.3293000Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:23.3293484Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:23.3293729Z 2025-05-07T20:33:23.3294003Z self = 2025-05-07T20:33:23.3295167Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:23.3296820Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f58ff6dd940>} 2025-05-07T20:33:23.3298282Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:23.3299387Z context = 2025-05-07T20:33:23.3299691Z 2025-05-07T20:33:23.3299866Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:23.3300411Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:23.3300905Z module_map=module_map) 2025-05-07T20:33:23.3301283Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:23.3301637Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:23.3301903Z E ^ 2025-05-07T20:33:23.3302391Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:23.3302877Z 2025-05-07T20:33:23.3303330Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:23.3303886Z 2025-05-07T20:33:23.3303988Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:23.3304419Z self=, 2025-05-07T20:33:23.3304840Z T=4096, 2025-05-07T20:33:23.3305026Z D=5120, 2025-05-07T20:33:23.3305222Z scale_ub=1200.0, 2025-05-07T20:33:23.3305447Z contiguous=True, 2025-05-07T20:33:23.3305666Z compiled=False, 2025-05-07T20:33:23.3305867Z ) 2025-05-07T20:33:23.3306191Z self = 2025-05-07T20:33:23.3306712Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:23.3307005Z 2025-05-07T20:33:23.3307081Z @given( 2025-05-07T20:33:23.3307311Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:23.3307684Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:23.3308025Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:23.3308398Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:23.3308776Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:23.3309094Z ) 2025-05-07T20:33:23.3309498Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:23.3310131Z def test_silu_mul_quant( 2025-05-07T20:33:23.3310378Z self, 2025-05-07T20:33:23.3310568Z T: int, 2025-05-07T20:33:23.3310762Z D: int, 2025-05-07T20:33:23.3310984Z scale_ub: Optional[float], 2025-05-07T20:33:23.3311252Z contiguous: bool, 2025-05-07T20:33:23.3311491Z compiled: bool, 2025-05-07T20:33:23.3311720Z ) -> None: 2025-05-07T20:33:23.3311932Z torch.manual_seed(2025) 2025-05-07T20:33:23.3312185Z 2025-05-07T20:33:23.3312457Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:23.3312810Z 2025-05-07T20:33:23.3313008Z x_sign = torch.sign(x) 2025-05-07T20:33:23.3313304Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:23.3313619Z x = x_sign * x_clamp 2025-05-07T20:33:23.3313861Z x0 = x[:, :D] 2025-05-07T20:33:23.3314077Z x1 = x[:, D:] 2025-05-07T20:33:23.3314327Z 2025-05-07T20:33:23.3314510Z if contiguous: 2025-05-07T20:33:23.3314779Z x0 = x0.contiguous() 2025-05-07T20:33:23.3315041Z x1 = x1.contiguous() 2025-05-07T20:33:23.3315282Z 2025-05-07T20:33:23.3315471Z if scale_ub is not None: 2025-05-07T20:33:23.3315747Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:23.3316081Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:23.3316437Z ) 2025-05-07T20:33:23.3316630Z else: 2025-05-07T20:33:23.3316831Z scale_ub_tensor = None 2025-05-07T20:33:23.3317087Z 2025-05-07T20:33:23.3317317Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:23.3317633Z op = silu_mul_quant 2025-05-07T20:33:23.3317889Z if compiled: 2025-05-07T20:33:23.3318138Z op = torch.compile(op) 2025-05-07T20:33:23.3318431Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:23.3318713Z 2025-05-07T20:33:23.3318908Z > y_fp8, y_scale = fn() 2025-05-07T20:33:23.3319075Z 2025-05-07T20:33:23.3319180Z moe/activation_test.py:117: 2025-05-07T20:33:23.3319474Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:23.3319820Z moe/activation_test.py:115: in fn 2025-05-07T20:33:23.3320106Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:23.3320831Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:23.3321579Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:23.3322150Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:23.3322881Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:23.3323590Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:23.3324162Z kernel = self.compile( 2025-05-07T20:33:23.3324738Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:23.3325436Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:23.3325850Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:23.3326097Z 2025-05-07T20:33:23.3326308Z self = 2025-05-07T20:33:23.3327527Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:23.3329025Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f58ff73a8b0>} 2025-05-07T20:33:23.3330492Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:23.3331601Z context = 2025-05-07T20:33:23.3331905Z 2025-05-07T20:33:23.3332081Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:23.3332633Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:23.3333120Z module_map=module_map) 2025-05-07T20:33:23.3333503Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:23.3333868Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:23.3334128Z E ^ 2025-05-07T20:33:23.3334617Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:23.3335102Z 2025-05-07T20:33:23.3335599Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:23.3336189Z 2025-05-07T20:33:23.3336294Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:23.3336718Z self=, 2025-05-07T20:33:23.3337144Z T=1, 2025-05-07T20:33:23.3337322Z D=5120, 2025-05-07T20:33:23.3337581Z scale_ub=None, 2025-05-07T20:33:23.3337793Z contiguous=True, 2025-05-07T20:33:23.3338013Z compiled=True, 2025-05-07T20:33:23.3338208Z ) 2025-05-07T20:33:23.9874448Z self = 2025-05-07T20:33:23.9875167Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:23.9875445Z 2025-05-07T20:33:23.9875538Z @given( 2025-05-07T20:33:23.9875769Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:23.9876101Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:23.9876428Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:23.9876774Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:23.9877117Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:23.9877416Z ) 2025-05-07T20:33:23.9877782Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:23.9878247Z def test_silu_mul_quant( 2025-05-07T20:33:23.9878499Z self, 2025-05-07T20:33:23.9878693Z T: int, 2025-05-07T20:33:23.9878884Z D: int, 2025-05-07T20:33:23.9885519Z scale_ub: Optional[float], 2025-05-07T20:33:23.9885863Z contiguous: bool, 2025-05-07T20:33:23.9886127Z compiled: bool, 2025-05-07T20:33:23.9886367Z ) -> None: 2025-05-07T20:33:23.9886600Z torch.manual_seed(2025) 2025-05-07T20:33:23.9886898Z 2025-05-07T20:33:23.9887186Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:23.9887554Z 2025-05-07T20:33:23.9887762Z x_sign = torch.sign(x) 2025-05-07T20:33:23.9888059Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:23.9888387Z x = x_sign * x_clamp 2025-05-07T20:33:23.9888640Z x0 = x[:, :D] 2025-05-07T20:33:23.9888856Z x1 = x[:, D:] 2025-05-07T20:33:23.9889077Z 2025-05-07T20:33:23.9889276Z if contiguous: 2025-05-07T20:33:23.9889508Z x0 = x0.contiguous() 2025-05-07T20:33:23.9889785Z x1 = x1.contiguous() 2025-05-07T20:33:23.9890040Z 2025-05-07T20:33:23.9890232Z if scale_ub is not None: 2025-05-07T20:33:23.9890517Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:23.9891175Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:23.9891515Z ) 2025-05-07T20:33:23.9891714Z else: 2025-05-07T20:33:23.9891934Z scale_ub_tensor = None 2025-05-07T20:33:23.9892200Z 2025-05-07T20:33:23.9892437Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:23.9892778Z op = silu_mul_quant 2025-05-07T20:33:23.9893051Z if compiled: 2025-05-07T20:33:23.9893299Z op = torch.compile(op) 2025-05-07T20:33:23.9893611Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:23.9893901Z 2025-05-07T20:33:23.9894095Z y_fp8, y_scale = fn() 2025-05-07T20:33:23.9894392Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:23.9894703Z 2025-05-07T20:33:23.9894947Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:23.9895296Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:23.9895607Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:23.9895930Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:23.9896308Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:23.9896635Z 2025-05-07T20:33:23.9896834Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:23.9897130Z 2025-05-07T20:33:23.9897239Z moe/activation_test.py:126: 2025-05-07T20:33:23.9897619Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:23.9897974Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:23.9898309Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:23.9899153Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:23.9900048Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:23.9900629Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:23.9901361Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:23.9902090Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:23.9902867Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:23.9903675Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:33:23.9904477Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:23.9905253Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:23.9905936Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:23.9906577Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:23.9907124Z fn() 2025-05-07T20:33:23.9907658Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:23.9908279Z self.fn.run( 2025-05-07T20:33:23.9908771Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:23.9909334Z kernel = self.compile( 2025-05-07T20:33:23.9910070Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:23.9910767Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:23.9911175Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:23.9911426Z 2025-05-07T20:33:23.9911638Z self = 2025-05-07T20:33:23.9912860Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:23.9914381Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f58ff44e550>} 2025-05-07T20:33:23.9915850Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:23.9916990Z context = 2025-05-07T20:33:23.9917318Z 2025-05-07T20:33:23.9917487Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:23.9918034Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:23.9918528Z module_map=module_map) 2025-05-07T20:33:23.9918897Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:23.9919261Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:23.9919535Z E ^ 2025-05-07T20:33:23.9920065Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:23.9920603Z 2025-05-07T20:33:23.9921051Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:23.9921614Z 2025-05-07T20:33:23.9921716Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:23.9922146Z self=, 2025-05-07T20:33:23.9922601Z T=2048, 2025-05-07T20:33:23.9922785Z D=5120, 2025-05-07T20:33:23.9922976Z scale_ub=None, 2025-05-07T20:33:23.9923186Z contiguous=True, 2025-05-07T20:33:23.9923412Z compiled=True, 2025-05-07T20:33:23.9923624Z ) 2025-05-07T20:33:24.6038460Z self = 2025-05-07T20:33:24.6039286Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:24.6039692Z 2025-05-07T20:33:24.6039807Z @given( 2025-05-07T20:33:24.6040155Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:24.6040574Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:24.6040992Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:24.6041430Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:24.6041799Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:24.6042106Z ) 2025-05-07T20:33:24.6042479Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:24.6042955Z def test_silu_mul_quant( 2025-05-07T20:33:24.6043204Z self, 2025-05-07T20:33:24.6043410Z T: int, 2025-05-07T20:33:24.6043608Z D: int, 2025-05-07T20:33:24.6043838Z scale_ub: Optional[float], 2025-05-07T20:33:24.6044118Z contiguous: bool, 2025-05-07T20:33:24.6044361Z compiled: bool, 2025-05-07T20:33:24.6044602Z ) -> None: 2025-05-07T20:33:24.6044830Z torch.manual_seed(2025) 2025-05-07T20:33:24.6045084Z 2025-05-07T20:33:24.6045366Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:24.6045741Z 2025-05-07T20:33:24.6045946Z x_sign = torch.sign(x) 2025-05-07T20:33:24.6046244Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:24.6046578Z x = x_sign * x_clamp 2025-05-07T20:33:24.6046833Z x0 = x[:, :D] 2025-05-07T20:33:24.6047053Z x1 = x[:, D:] 2025-05-07T20:33:24.6047284Z 2025-05-07T20:33:24.6047481Z if contiguous: 2025-05-07T20:33:24.6047713Z x0 = x0.contiguous() 2025-05-07T20:33:24.6047987Z x1 = x1.contiguous() 2025-05-07T20:33:24.6048240Z 2025-05-07T20:33:24.6048721Z if scale_ub is not None: 2025-05-07T20:33:24.6049013Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:24.6049364Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:24.6049688Z ) 2025-05-07T20:33:24.6049890Z else: 2025-05-07T20:33:24.6050107Z scale_ub_tensor = None 2025-05-07T20:33:24.6050380Z 2025-05-07T20:33:24.6050615Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:24.6050950Z op = silu_mul_quant 2025-05-07T20:33:24.6051212Z if compiled: 2025-05-07T20:33:24.6051464Z op = torch.compile(op) 2025-05-07T20:33:24.6051776Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:24.6052067Z 2025-05-07T20:33:24.6052266Z y_fp8, y_scale = fn() 2025-05-07T20:33:24.6052560Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:24.6052869Z 2025-05-07T20:33:24.6053107Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:24.6053465Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:24.6053779Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:24.6054104Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:24.6054485Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:24.6054900Z 2025-05-07T20:33:24.6055119Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:24.6055430Z 2025-05-07T20:33:24.6055531Z moe/activation_test.py:126: 2025-05-07T20:33:24.6055837Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:24.6056187Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:24.6056520Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:24.6057465Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:24.6058297Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:24.6058883Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:24.6059620Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:24.6060370Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:24.6061167Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:24.6061985Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:33:24.6062795Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:24.6063591Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:24.6064285Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:24.6064925Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:24.6065488Z fn() 2025-05-07T20:33:24.6066031Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:24.6066665Z self.fn.run( 2025-05-07T20:33:24.6067207Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:24.6067786Z kernel = self.compile( 2025-05-07T20:33:24.6068365Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:24.6069071Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:24.6069495Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:24.6069927Z 2025-05-07T20:33:24.6070198Z self = 2025-05-07T20:33:24.6071397Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:24.6072958Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f58fefa7f70>} 2025-05-07T20:33:24.6074450Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:24.6075573Z context = 2025-05-07T20:33:24.6075888Z 2025-05-07T20:33:24.6076058Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:24.6076620Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:24.6077110Z module_map=module_map) 2025-05-07T20:33:24.6077495Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:24.6077868Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:24.6078226Z E ^ 2025-05-07T20:33:24.6078724Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:24.6079269Z 2025-05-07T20:33:24.6079725Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:24.6080291Z 2025-05-07T20:33:24.6080401Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:24.6080869Z self=, 2025-05-07T20:33:24.6081296Z T=128, 2025-05-07T20:33:24.6081487Z D=5120, 2025-05-07T20:33:24.6081676Z scale_ub=None, 2025-05-07T20:33:24.6081893Z contiguous=True, 2025-05-07T20:33:24.6082120Z compiled=True, 2025-05-07T20:33:24.6082320Z ) 2025-05-07T20:33:25.5951103Z self = 2025-05-07T20:33:25.5951880Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:25.5952298Z 2025-05-07T20:33:25.5952421Z @given( 2025-05-07T20:33:25.5952741Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:25.5953082Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:25.5953413Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:25.5953762Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:25.5954118Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:25.5954436Z ) 2025-05-07T20:33:25.5954820Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:25.5955307Z def test_silu_mul_quant( 2025-05-07T20:33:25.5955577Z self, 2025-05-07T20:33:25.5955784Z T: int, 2025-05-07T20:33:25.5955988Z D: int, 2025-05-07T20:33:25.5956222Z scale_ub: Optional[float], 2025-05-07T20:33:25.5956514Z contiguous: bool, 2025-05-07T20:33:25.5956764Z compiled: bool, 2025-05-07T20:33:25.5957009Z ) -> None: 2025-05-07T20:33:25.5957263Z torch.manual_seed(2025) 2025-05-07T20:33:25.5957538Z 2025-05-07T20:33:25.5957825Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:25.5958197Z 2025-05-07T20:33:25.5958423Z x_sign = torch.sign(x) 2025-05-07T20:33:25.5958724Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:25.5959056Z x = x_sign * x_clamp 2025-05-07T20:33:25.5959311Z x0 = x[:, :D] 2025-05-07T20:33:25.5959531Z x1 = x[:, D:] 2025-05-07T20:33:25.5959759Z 2025-05-07T20:33:25.5959955Z if contiguous: 2025-05-07T20:33:25.5960190Z x0 = x0.contiguous() 2025-05-07T20:33:25.5960794Z x1 = x1.contiguous() 2025-05-07T20:33:25.5961051Z 2025-05-07T20:33:25.5961255Z if scale_ub is not None: 2025-05-07T20:33:25.5961537Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:25.5961889Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:25.5962209Z ) 2025-05-07T20:33:25.5962397Z else: 2025-05-07T20:33:25.5962606Z scale_ub_tensor = None 2025-05-07T20:33:25.5962872Z 2025-05-07T20:33:25.5963096Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:25.5963424Z op = silu_mul_quant 2025-05-07T20:33:25.5963676Z if compiled: 2025-05-07T20:33:25.5963920Z op = torch.compile(op) 2025-05-07T20:33:25.5964221Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:25.5964511Z 2025-05-07T20:33:25.5964696Z y_fp8, y_scale = fn() 2025-05-07T20:33:25.5964983Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:25.5965287Z 2025-05-07T20:33:25.5965520Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:25.5965868Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:25.5966180Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:25.5966507Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:25.5966971Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:25.5967399Z 2025-05-07T20:33:25.5967609Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:25.5967812Z 2025-05-07T20:33:25.5967911Z moe/activation_test.py:126: 2025-05-07T20:33:25.5968214Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:25.5968562Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:25.5968983Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:25.5969838Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:25.5970659Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:25.5971240Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:25.5971971Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:25.5972711Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:25.5973493Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:25.5974303Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:33:25.5975108Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:25.5975895Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:25.5976581Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:25.5977225Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:25.5977776Z fn() 2025-05-07T20:33:25.5978317Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:25.5978942Z self.fn.run( 2025-05-07T20:33:25.5979428Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:25.5979992Z kernel = self.compile( 2025-05-07T20:33:25.5980570Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:25.5981274Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:25.5981687Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:25.5981991Z 2025-05-07T20:33:25.5982205Z self = 2025-05-07T20:33:25.5983716Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:25.5985238Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f58ff2d2b80>} 2025-05-07T20:33:25.5986704Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:25.5987816Z context = 2025-05-07T20:33:25.5988126Z 2025-05-07T20:33:25.5988297Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:25.5988843Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:25.5989326Z module_map=module_map) 2025-05-07T20:33:25.5989706Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:25.5990245Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:25.5990597Z E ^ 2025-05-07T20:33:25.5991146Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:25.5991712Z 2025-05-07T20:33:25.5992222Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:25.5992919Z 2025-05-07T20:33:25.5993033Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:25.5993502Z self=, 2025-05-07T20:33:25.5993973Z T=4096, 2025-05-07T20:33:25.5994170Z D=5120, 2025-05-07T20:33:25.5994366Z scale_ub=None, 2025-05-07T20:33:25.5994594Z contiguous=True, 2025-05-07T20:33:25.5994832Z compiled=True, 2025-05-07T20:33:25.5995049Z ) 2025-05-07T20:33:26.4281572Z self = 2025-05-07T20:33:26.4282357Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:26.4282700Z 2025-05-07T20:33:26.4283021Z @given( 2025-05-07T20:33:26.4283267Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:26.4283586Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:26.4283902Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:26.4284245Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:26.4284583Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:26.4284881Z ) 2025-05-07T20:33:26.4285253Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:26.4285711Z def test_silu_mul_quant( 2025-05-07T20:33:26.4285955Z self, 2025-05-07T20:33:26.4286152Z T: int, 2025-05-07T20:33:26.4286347Z D: int, 2025-05-07T20:33:26.4286573Z scale_ub: Optional[float], 2025-05-07T20:33:26.4286847Z contiguous: bool, 2025-05-07T20:33:26.4287102Z compiled: bool, 2025-05-07T20:33:26.4287326Z ) -> None: 2025-05-07T20:33:26.4287547Z torch.manual_seed(2025) 2025-05-07T20:33:26.4287796Z 2025-05-07T20:33:26.4288066Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:26.4288428Z 2025-05-07T20:33:26.4288625Z x_sign = torch.sign(x) 2025-05-07T20:33:26.4288914Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:26.4289238Z x = x_sign * x_clamp 2025-05-07T20:33:26.4289482Z x0 = x[:, :D] 2025-05-07T20:33:26.4289695Z x1 = x[:, D:] 2025-05-07T20:33:26.4289903Z 2025-05-07T20:33:26.4290416Z if contiguous: 2025-05-07T20:33:26.4290646Z x0 = x0.contiguous() 2025-05-07T20:33:26.4290905Z x1 = x1.contiguous() 2025-05-07T20:33:26.4291150Z 2025-05-07T20:33:26.4291335Z if scale_ub is not None: 2025-05-07T20:33:26.4291611Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:26.4291952Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:26.4292268Z ) 2025-05-07T20:33:26.4292453Z else: 2025-05-07T20:33:26.4292660Z scale_ub_tensor = None 2025-05-07T20:33:26.4292917Z 2025-05-07T20:33:26.4293144Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:26.4293468Z op = silu_mul_quant 2025-05-07T20:33:26.4293721Z if compiled: 2025-05-07T20:33:26.4293962Z op = torch.compile(op) 2025-05-07T20:33:26.4294264Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:26.4294548Z 2025-05-07T20:33:26.4294733Z y_fp8, y_scale = fn() 2025-05-07T20:33:26.4295026Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:26.4295325Z 2025-05-07T20:33:26.4295556Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:26.4295899Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:26.4296295Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:26.4296614Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:26.4297076Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:26.4297446Z 2025-05-07T20:33:26.4297651Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:26.4297850Z 2025-05-07T20:33:26.4297946Z moe/activation_test.py:126: 2025-05-07T20:33:26.4298245Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:26.4298668Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:26.4298997Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:26.4299849Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:26.4300665Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:26.4301251Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:26.4301976Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:26.4302718Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:26.4303494Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:26.4304302Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:33:26.4305101Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:26.4305885Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:26.4306568Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:26.4307201Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:26.4307757Z fn() 2025-05-07T20:33:26.4308294Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:26.4308918Z self.fn.run( 2025-05-07T20:33:26.4309401Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:26.4310121Z kernel = self.compile( 2025-05-07T20:33:26.4310697Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:26.4311387Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:26.4312376Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:26.4312631Z 2025-05-07T20:33:26.4312847Z self = 2025-05-07T20:33:26.4314016Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:26.4315549Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f58feddf5e0>} 2025-05-07T20:33:26.4317010Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:26.4318139Z context = 2025-05-07T20:33:26.4324448Z 2025-05-07T20:33:26.4324641Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:26.4325214Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:26.4325803Z module_map=module_map) 2025-05-07T20:33:26.4326185Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:26.4326609Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:26.4326895Z E ^ 2025-05-07T20:33:26.4327386Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:26.4327890Z 2025-05-07T20:33:26.4328343Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:26.4328958Z 2025-05-07T20:33:26.4329061Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:26.4329493Z self=, 2025-05-07T20:33:26.4329910Z T=16384, 2025-05-07T20:33:26.4330109Z D=5120, 2025-05-07T20:33:26.4330311Z scale_ub=None, 2025-05-07T20:33:26.4330521Z contiguous=True, 2025-05-07T20:33:26.4330752Z compiled=True, 2025-05-07T20:33:26.4330970Z ) 2025-05-07T20:33:26.4752350Z W0507 20:33:26.473680 88490 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] torch._dynamo hit config.recompile_limit (8) 2025-05-07T20:33:26.4753753Z W0507 20:33:26.473680 88490 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] function: 'silu_mul_quant' (/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:55) 2025-05-07T20:33:26.4755245Z W0507 20:33:26.473680 88490 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] last reason: 0/7: tensor 'x0' stride mismatch at index 0. expected 5120, actual 10240 2025-05-07T20:33:26.4756341Z W0507 20:33:26.473680 88490 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To log all recompilation reasons, use TORCH_LOGS="recompiles". 2025-05-07T20:33:26.4757557Z W0507 20:33:26.473680 88490 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html. 2025-05-07T20:33:26.5957710Z self = 2025-05-07T20:33:26.5958520Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:26.5958931Z 2025-05-07T20:33:26.5959045Z @given( 2025-05-07T20:33:26.5959305Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:26.5959637Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:26.5959959Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:26.5960309Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:26.5960655Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:26.5961242Z ) 2025-05-07T20:33:26.5961613Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:26.5962091Z def test_silu_mul_quant( 2025-05-07T20:33:26.5962339Z self, 2025-05-07T20:33:26.5962543Z T: int, 2025-05-07T20:33:26.5962745Z D: int, 2025-05-07T20:33:26.5962970Z scale_ub: Optional[float], 2025-05-07T20:33:26.5963261Z contiguous: bool, 2025-05-07T20:33:26.5963513Z compiled: bool, 2025-05-07T20:33:26.5963750Z ) -> None: 2025-05-07T20:33:26.5963965Z torch.manual_seed(2025) 2025-05-07T20:33:26.5964223Z 2025-05-07T20:33:26.5964507Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:26.5964863Z 2025-05-07T20:33:26.5965060Z x_sign = torch.sign(x) 2025-05-07T20:33:26.5965358Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:26.5965680Z x = x_sign * x_clamp 2025-05-07T20:33:26.5965923Z x0 = x[:, :D] 2025-05-07T20:33:26.5966149Z x1 = x[:, D:] 2025-05-07T20:33:26.5966363Z 2025-05-07T20:33:26.5966547Z if contiguous: 2025-05-07T20:33:26.5966778Z x0 = x0.contiguous() 2025-05-07T20:33:26.5967041Z x1 = x1.contiguous() 2025-05-07T20:33:26.5967285Z 2025-05-07T20:33:26.5967477Z if scale_ub is not None: 2025-05-07T20:33:26.5967847Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:26.5968262Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:26.5968583Z ) 2025-05-07T20:33:26.5968775Z else: 2025-05-07T20:33:26.5968980Z scale_ub_tensor = None 2025-05-07T20:33:26.5969237Z 2025-05-07T20:33:26.5969474Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:26.5969884Z op = silu_mul_quant 2025-05-07T20:33:26.5970133Z if compiled: 2025-05-07T20:33:26.5970384Z op = torch.compile(op) 2025-05-07T20:33:26.5970692Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:26.5970976Z 2025-05-07T20:33:26.5971168Z y_fp8, y_scale = fn() 2025-05-07T20:33:26.5971456Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:26.5971752Z 2025-05-07T20:33:26.5971994Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:26.5972343Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:26.5972640Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:26.5972966Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:26.5973339Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:26.5973660Z 2025-05-07T20:33:26.5973866Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:26.5974077Z 2025-05-07T20:33:26.5974182Z moe/activation_test.py:126: 2025-05-07T20:33:26.5974490Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:26.5974830Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:26.5975170Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:26.5976025Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:26.5976840Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:26.5977424Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:26.5978164Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:26.5978908Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:26.5979674Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:26.5980486Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:33:26.5981341Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:26.5982128Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:26.5983097Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:26.5983741Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:26.5984294Z fn() 2025-05-07T20:33:26.5984823Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:26.5985445Z self.fn.run( 2025-05-07T20:33:26.5985933Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:26.5986499Z kernel = self.compile( 2025-05-07T20:33:26.5987064Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:26.5987814Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:26.5988229Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:26.5988470Z 2025-05-07T20:33:26.5988694Z self = 2025-05-07T20:33:26.5990112Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:26.5991698Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f58ff41bee0>} 2025-05-07T20:33:26.5993227Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:26.5994334Z context = 2025-05-07T20:33:26.5994639Z 2025-05-07T20:33:26.5994809Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:26.5995362Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:26.5995860Z module_map=module_map) 2025-05-07T20:33:26.5996245Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:26.5996608Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:26.5996883Z E ^ 2025-05-07T20:33:26.5997377Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:26.5997870Z 2025-05-07T20:33:26.5998319Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:26.5998882Z 2025-05-07T20:33:26.5998989Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:26.5999419Z self=, 2025-05-07T20:33:26.5999844Z T=1, 2025-05-07T20:33:26.6000025Z D=5120, 2025-05-07T20:33:26.6000230Z scale_ub=1200.0, 2025-05-07T20:33:26.6000457Z contiguous=True, 2025-05-07T20:33:26.6000676Z compiled=True, 2025-05-07T20:33:26.6000890Z ) 2025-05-07T20:33:26.7700997Z self = 2025-05-07T20:33:26.7701766Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:26.7702139Z 2025-05-07T20:33:26.7702244Z @given( 2025-05-07T20:33:26.7702535Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:26.7702878Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:26.7703193Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:26.7703524Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:26.7704044Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:26.7704341Z ) 2025-05-07T20:33:26.7704696Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:26.7705163Z def test_silu_mul_quant( 2025-05-07T20:33:26.7705412Z self, 2025-05-07T20:33:26.7705607Z T: int, 2025-05-07T20:33:26.7705808Z D: int, 2025-05-07T20:33:26.7706036Z scale_ub: Optional[float], 2025-05-07T20:33:26.7706305Z contiguous: bool, 2025-05-07T20:33:26.7706548Z compiled: bool, 2025-05-07T20:33:26.7706773Z ) -> None: 2025-05-07T20:33:26.7706980Z torch.manual_seed(2025) 2025-05-07T20:33:26.7707230Z 2025-05-07T20:33:26.7707501Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:26.7707857Z 2025-05-07T20:33:26.7708041Z x_sign = torch.sign(x) 2025-05-07T20:33:26.7708332Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:26.7708649Z x = x_sign * x_clamp 2025-05-07T20:33:26.7708893Z x0 = x[:, :D] 2025-05-07T20:33:26.7709110Z x1 = x[:, D:] 2025-05-07T20:33:26.7709320Z 2025-05-07T20:33:26.7709498Z if contiguous: 2025-05-07T20:33:26.7709732Z x0 = x0.contiguous() 2025-05-07T20:33:26.7710310Z x1 = x1.contiguous() 2025-05-07T20:33:26.7710550Z 2025-05-07T20:33:26.7710737Z if scale_ub is not None: 2025-05-07T20:33:26.7711092Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:26.7711431Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:26.7711749Z ) 2025-05-07T20:33:26.7711936Z else: 2025-05-07T20:33:26.7712142Z scale_ub_tensor = None 2025-05-07T20:33:26.7712508Z 2025-05-07T20:33:26.7712738Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:26.7713062Z op = silu_mul_quant 2025-05-07T20:33:26.7713309Z if compiled: 2025-05-07T20:33:26.7713560Z op = torch.compile(op) 2025-05-07T20:33:26.7713862Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:26.7714141Z 2025-05-07T20:33:26.7714333Z > y_fp8, y_scale = fn() 2025-05-07T20:33:26.7714499Z 2025-05-07T20:33:26.7714604Z moe/activation_test.py:117: 2025-05-07T20:33:26.7714906Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:26.7715254Z moe/activation_test.py:115: in fn 2025-05-07T20:33:26.7715538Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:26.7716119Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:26.7716717Z return fn(*args, **kwargs) 2025-05-07T20:33:26.7717423Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:26.7718171Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:26.7718733Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:26.7719461Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:26.7720167Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:26.7720738Z kernel = self.compile( 2025-05-07T20:33:26.7721308Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:26.7722009Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:26.7722422Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:26.7722665Z 2025-05-07T20:33:26.7722880Z self = 2025-05-07T20:33:26.7724104Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:26.7725619Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f58fedaf700>} 2025-05-07T20:33:26.7727093Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:26.7728215Z context = 2025-05-07T20:33:26.7728523Z 2025-05-07T20:33:26.7728693Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:26.7729244Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:26.7729738Z module_map=module_map) 2025-05-07T20:33:26.7730109Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:26.7730473Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:26.7730747Z E ^ 2025-05-07T20:33:26.7731230Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:26.7731768Z 2025-05-07T20:33:26.7732219Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:26.7732820Z 2025-05-07T20:33:26.7732921Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:26.7733350Z self=, 2025-05-07T20:33:26.7733774Z T=1, 2025-05-07T20:33:26.7733957Z D=5120, 2025-05-07T20:33:26.7734193Z scale_ub=None, 2025-05-07T20:33:26.7734403Z contiguous=False, 2025-05-07T20:33:26.7734630Z compiled=True, 2025-05-07T20:33:26.7734837Z ) 2025-05-07T20:33:26.8542347Z self = 2025-05-07T20:33:26.8543062Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:26.8543450Z 2025-05-07T20:33:26.8543558Z @given( 2025-05-07T20:33:26.8543870Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:26.8544294Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:26.8544699Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:26.8545144Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:26.8545575Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:26.8545870Z ) 2025-05-07T20:33:26.8546238Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:26.8546699Z def test_silu_mul_quant( 2025-05-07T20:33:26.8546941Z self, 2025-05-07T20:33:26.8547132Z T: int, 2025-05-07T20:33:26.8547324Z D: int, 2025-05-07T20:33:26.8547532Z scale_ub: Optional[float], 2025-05-07T20:33:26.8547808Z contiguous: bool, 2025-05-07T20:33:26.8548048Z compiled: bool, 2025-05-07T20:33:26.8548268Z ) -> None: 2025-05-07T20:33:26.8548479Z torch.manual_seed(2025) 2025-05-07T20:33:26.8548720Z 2025-05-07T20:33:26.8548984Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:26.8549343Z 2025-05-07T20:33:26.8549536Z x_sign = torch.sign(x) 2025-05-07T20:33:26.8549970Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:26.8550290Z x = x_sign * x_clamp 2025-05-07T20:33:26.8550528Z x0 = x[:, :D] 2025-05-07T20:33:26.8550740Z x1 = x[:, D:] 2025-05-07T20:33:26.8550946Z 2025-05-07T20:33:26.8551126Z if contiguous: 2025-05-07T20:33:26.8551361Z x0 = x0.contiguous() 2025-05-07T20:33:26.8551616Z x1 = x1.contiguous() 2025-05-07T20:33:26.8551869Z 2025-05-07T20:33:26.8552058Z if scale_ub is not None: 2025-05-07T20:33:26.8552525Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:26.8552869Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:26.8553211Z ) 2025-05-07T20:33:26.8553402Z else: 2025-05-07T20:33:26.8553603Z scale_ub_tensor = None 2025-05-07T20:33:26.8553861Z 2025-05-07T20:33:26.8554100Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:26.8554415Z op = silu_mul_quant 2025-05-07T20:33:26.8554669Z if compiled: 2025-05-07T20:33:26.8554917Z op = torch.compile(op) 2025-05-07T20:33:26.8555214Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:26.8555499Z 2025-05-07T20:33:26.8555695Z y_fp8, y_scale = fn() 2025-05-07T20:33:26.8555980Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:26.8556281Z 2025-05-07T20:33:26.8556517Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:26.8556861Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:26.8557157Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:26.8557478Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:26.8557848Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:26.8558163Z 2025-05-07T20:33:26.8558439Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:26.8558643Z 2025-05-07T20:33:26.8558745Z moe/activation_test.py:126: 2025-05-07T20:33:26.8559118Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:26.8559464Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:26.8559801Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:26.8560649Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:26.8561533Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:26.8562113Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:26.8562842Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:26.8563572Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:26.8564348Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:26.8565155Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:33:26.8565960Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:26.8566736Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:26.8567426Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:26.8568064Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:26.8568613Z fn() 2025-05-07T20:33:26.8569137Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:26.8569765Z self.fn.run( 2025-05-07T20:33:26.8570265Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:26.8570827Z kernel = self.compile( 2025-05-07T20:33:26.8571403Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:26.8572101Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:26.8572523Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:26.8572772Z 2025-05-07T20:33:26.8572987Z self = 2025-05-07T20:33:26.8574212Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:26.8575728Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f58fe6803a0>} 2025-05-07T20:33:26.8577194Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:26.8578349Z context = 2025-05-07T20:33:26.8578653Z 2025-05-07T20:33:26.8578823Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:26.8579368Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:26.8579865Z module_map=module_map) 2025-05-07T20:33:26.8580236Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:26.8580601Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:26.8580878Z E ^ 2025-05-07T20:33:26.8581417Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:26.8581940Z 2025-05-07T20:33:26.8582387Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:26.8583399Z 2025-05-07T20:33:26.8583502Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:26.8583936Z self=, 2025-05-07T20:33:26.8584434Z T=1, 2025-05-07T20:33:26.8584617Z D=5120, 2025-05-07T20:33:26.8584810Z scale_ub=None, 2025-05-07T20:33:26.8585021Z contiguous=True, 2025-05-07T20:33:26.8585238Z compiled=False, 2025-05-07T20:33:26.8585445Z ) 2025-05-07T20:33:27.2160964Z self = 2025-05-07T20:33:27.2161725Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:27.2162101Z 2025-05-07T20:33:27.2162203Z @given( 2025-05-07T20:33:27.2162565Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:27.2162976Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:27.2163368Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:27.2163722Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:27.2164055Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:27.2164351Z ) 2025-05-07T20:33:27.2164710Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:27.2165179Z def test_silu_mul_quant( 2025-05-07T20:33:27.2165424Z self, 2025-05-07T20:33:27.2165616Z T: int, 2025-05-07T20:33:27.2165814Z D: int, 2025-05-07T20:33:27.2166036Z scale_ub: Optional[float], 2025-05-07T20:33:27.2166311Z contiguous: bool, 2025-05-07T20:33:27.2166551Z compiled: bool, 2025-05-07T20:33:27.2166774Z ) -> None: 2025-05-07T20:33:27.2166995Z torch.manual_seed(2025) 2025-05-07T20:33:27.2167237Z 2025-05-07T20:33:27.2167506Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:27.2167866Z 2025-05-07T20:33:27.2168058Z x_sign = torch.sign(x) 2025-05-07T20:33:27.2168344Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:27.2168666Z x = x_sign * x_clamp 2025-05-07T20:33:27.2168910Z x0 = x[:, :D] 2025-05-07T20:33:27.2169122Z x1 = x[:, D:] 2025-05-07T20:33:27.2169335Z 2025-05-07T20:33:27.2169522Z if contiguous: 2025-05-07T20:33:27.2169750Z x0 = x0.contiguous() 2025-05-07T20:33:27.2170013Z x1 = x1.contiguous() 2025-05-07T20:33:27.2170253Z 2025-05-07T20:33:27.2170736Z if scale_ub is not None: 2025-05-07T20:33:27.2171022Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:27.2171365Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:27.2171683Z ) 2025-05-07T20:33:27.2171865Z else: 2025-05-07T20:33:27.2172081Z scale_ub_tensor = None 2025-05-07T20:33:27.2172339Z 2025-05-07T20:33:27.2172566Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:27.2172889Z op = silu_mul_quant 2025-05-07T20:33:27.2173145Z if compiled: 2025-05-07T20:33:27.2173388Z op = torch.compile(op) 2025-05-07T20:33:27.2173690Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:27.2173976Z 2025-05-07T20:33:27.2174161Z > y_fp8, y_scale = fn() 2025-05-07T20:33:27.2174332Z 2025-05-07T20:33:27.2174429Z moe/activation_test.py:117: 2025-05-07T20:33:27.2174731Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:27.2175070Z moe/activation_test.py:115: in fn 2025-05-07T20:33:27.2175353Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:27.2176091Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:27.2176928Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:27.2177492Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:27.2178301Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:27.2179009Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:27.2179657Z kernel = self.compile( 2025-05-07T20:33:27.2180223Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:27.2180922Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:27.2181334Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:27.2181575Z 2025-05-07T20:33:27.2181786Z self = 2025-05-07T20:33:27.2183257Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:27.2191350Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f58fe643940>} 2025-05-07T20:33:27.2192863Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:27.2193988Z context = 2025-05-07T20:33:27.2194310Z 2025-05-07T20:33:27.2194486Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:27.2195058Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:27.2195571Z module_map=module_map) 2025-05-07T20:33:27.2195957Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:27.2196340Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:27.2196613Z E ^ 2025-05-07T20:33:27.2197109Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:27.2197648Z 2025-05-07T20:33:27.2198119Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:27.2198688Z 2025-05-07T20:33:27.2198900Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:27.2199339Z self=, 2025-05-07T20:33:27.2199757Z T=128, 2025-05-07T20:33:27.2199951Z D=5120, 2025-05-07T20:33:27.2200152Z scale_ub=None, 2025-05-07T20:33:27.2200365Z contiguous=False, 2025-05-07T20:33:27.2200591Z compiled=True, 2025-05-07T20:33:27.2200804Z ) 2025-05-07T20:33:27.2201127Z self = 2025-05-07T20:33:27.2201650Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:27.2201938Z 2025-05-07T20:33:27.2202019Z @given( 2025-05-07T20:33:27.2202249Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:27.2202559Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:27.2202878Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:27.2203224Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:27.2203564Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:27.2203864Z ) 2025-05-07T20:33:27.2204236Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:27.2204701Z def test_silu_mul_quant( 2025-05-07T20:33:27.2204950Z self, 2025-05-07T20:33:27.2205150Z T: int, 2025-05-07T20:33:27.2205416Z D: int, 2025-05-07T20:33:27.2205645Z scale_ub: Optional[float], 2025-05-07T20:33:27.2205995Z contiguous: bool, 2025-05-07T20:33:27.2206242Z compiled: bool, 2025-05-07T20:33:27.2206462Z ) -> None: 2025-05-07T20:33:27.2206680Z torch.manual_seed(2025) 2025-05-07T20:33:27.2206929Z 2025-05-07T20:33:27.2207206Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:27.2207622Z 2025-05-07T20:33:27.2207824Z x_sign = torch.sign(x) 2025-05-07T20:33:27.2208119Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:27.2208429Z x = x_sign * x_clamp 2025-05-07T20:33:27.2208674Z x0 = x[:, :D] 2025-05-07T20:33:27.2208890Z x1 = x[:, D:] 2025-05-07T20:33:27.2209093Z 2025-05-07T20:33:27.2209277Z if contiguous: 2025-05-07T20:33:27.2209507Z x0 = x0.contiguous() 2025-05-07T20:33:27.2209763Z x1 = x1.contiguous() 2025-05-07T20:33:27.2210011Z 2025-05-07T20:33:27.2210211Z if scale_ub is not None: 2025-05-07T20:33:27.2210484Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:27.2210825Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:27.2211146Z ) 2025-05-07T20:33:27.2211328Z else: 2025-05-07T20:33:27.2211543Z scale_ub_tensor = None 2025-05-07T20:33:27.2211800Z 2025-05-07T20:33:27.2212031Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:27.2212353Z op = silu_mul_quant 2025-05-07T20:33:27.2212606Z if compiled: 2025-05-07T20:33:27.2212857Z op = torch.compile(op) 2025-05-07T20:33:27.2213155Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:27.2213436Z 2025-05-07T20:33:27.2213626Z > y_fp8, y_scale = fn() 2025-05-07T20:33:27.2213790Z 2025-05-07T20:33:27.2213885Z moe/activation_test.py:117: 2025-05-07T20:33:27.2214192Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:27.2214535Z moe/activation_test.py:115: in fn 2025-05-07T20:33:27.2214815Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:27.2215405Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:27.2216003Z return fn(*args, **kwargs) 2025-05-07T20:33:27.2216711Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:27.2217450Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:27.2218072Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:27.2218805Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:27.2219512Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:27.2220074Z kernel = self.compile( 2025-05-07T20:33:27.2220647Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:27.2221347Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:27.2221756Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:27.2222003Z 2025-05-07T20:33:27.2222214Z self = 2025-05-07T20:33:27.2223390Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:27.2224893Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f58fe4fc040>} 2025-05-07T20:33:27.2226418Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:27.2227593Z context = 2025-05-07T20:33:27.2227904Z 2025-05-07T20:33:27.2228074Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:27.2228623Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:27.2229738Z module_map=module_map) 2025-05-07T20:33:27.2230244Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:27.2230604Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:27.2230865Z E ^ 2025-05-07T20:33:27.2231349Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:27.2231842Z 2025-05-07T20:33:27.2232290Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:27.2232853Z 2025-05-07T20:33:27.2232952Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:27.2233379Z self=, 2025-05-07T20:33:27.2233794Z T=128, 2025-05-07T20:33:27.2233981Z D=7168, 2025-05-07T20:33:27.2234171Z scale_ub=1200.0, 2025-05-07T20:33:27.2234392Z contiguous=False, 2025-05-07T20:33:27.2234620Z compiled=False, 2025-05-07T20:33:27.2234824Z ) 2025-05-07T20:33:27.3767055Z self = 2025-05-07T20:33:27.3767626Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:27.3767969Z 2025-05-07T20:33:27.3768076Z @given( 2025-05-07T20:33:27.3768412Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:27.3768752Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:27.3769073Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:27.3769419Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:27.3769748Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:27.3770041Z ) 2025-05-07T20:33:27.3770403Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:27.3770869Z def test_silu_mul_quant( 2025-05-07T20:33:27.3771111Z self, 2025-05-07T20:33:27.3771300Z T: int, 2025-05-07T20:33:27.3771499Z D: int, 2025-05-07T20:33:27.3771712Z scale_ub: Optional[float], 2025-05-07T20:33:27.3771986Z contiguous: bool, 2025-05-07T20:33:27.3772491Z compiled: bool, 2025-05-07T20:33:27.3772716Z ) -> None: 2025-05-07T20:33:27.3772934Z torch.manual_seed(2025) 2025-05-07T20:33:27.3773185Z 2025-05-07T20:33:27.3773453Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:27.3773811Z 2025-05-07T20:33:27.3774011Z x_sign = torch.sign(x) 2025-05-07T20:33:27.3774303Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:27.3774627Z x = x_sign * x_clamp 2025-05-07T20:33:27.3774868Z x0 = x[:, :D] 2025-05-07T20:33:27.3775077Z x1 = x[:, D:] 2025-05-07T20:33:27.3775292Z 2025-05-07T20:33:27.3775478Z if contiguous: 2025-05-07T20:33:27.3775705Z x0 = x0.contiguous() 2025-05-07T20:33:27.3775972Z x1 = x1.contiguous() 2025-05-07T20:33:27.3776216Z 2025-05-07T20:33:27.3776411Z if scale_ub is not None: 2025-05-07T20:33:27.3776686Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:27.3777061Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:27.3777385Z ) 2025-05-07T20:33:27.3777605Z else: 2025-05-07T20:33:27.3777836Z scale_ub_tensor = None 2025-05-07T20:33:27.3778086Z 2025-05-07T20:33:27.3778318Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:27.3778750Z op = silu_mul_quant 2025-05-07T20:33:27.3779071Z if compiled: 2025-05-07T20:33:27.3779320Z op = torch.compile(op) 2025-05-07T20:33:27.3779625Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:27.3779900Z 2025-05-07T20:33:27.3780094Z > y_fp8, y_scale = fn() 2025-05-07T20:33:27.3780261Z 2025-05-07T20:33:27.3780364Z moe/activation_test.py:117: 2025-05-07T20:33:27.3780743Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:27.3781090Z moe/activation_test.py:115: in fn 2025-05-07T20:33:27.3781378Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:27.3782121Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:27.3783137Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:27.3783711Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:27.3784449Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:27.3785160Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:27.3785728Z kernel = self.compile( 2025-05-07T20:33:27.3786300Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:27.3787005Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:27.3787413Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:27.3787664Z 2025-05-07T20:33:27.3787874Z self = 2025-05-07T20:33:27.3789046Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:27.3790679Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f58fe4fcd30>} 2025-05-07T20:33:27.3792149Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:27.3793275Z context = 2025-05-07T20:33:27.3793586Z 2025-05-07T20:33:27.3793832Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:27.3794391Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:27.3794876Z module_map=module_map) 2025-05-07T20:33:27.3795251Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:27.3795614Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:27.3795883Z E ^ 2025-05-07T20:33:27.3796366Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:27.3796864Z 2025-05-07T20:33:27.3797312Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:27.3797869Z 2025-05-07T20:33:27.3797983Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:27.3798401Z self=, 2025-05-07T20:33:27.3798825Z T=128, 2025-05-07T20:33:27.3799011Z D=5120, 2025-05-07T20:33:27.3799202Z scale_ub=None, 2025-05-07T20:33:27.3799412Z contiguous=False, 2025-05-07T20:33:27.3799641Z compiled=False, 2025-05-07T20:33:27.3799851Z ) 2025-05-07T20:33:27.3800170Z self = 2025-05-07T20:33:27.3800767Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:27.3801114Z 2025-05-07T20:33:27.3801200Z @given( 2025-05-07T20:33:27.3801423Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:27.3801748Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:27.3802057Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:27.3802405Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:27.3802806Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:27.3803104Z ) 2025-05-07T20:33:27.3803460Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:27.3803923Z def test_silu_mul_quant( 2025-05-07T20:33:27.3804166Z self, 2025-05-07T20:33:27.3804350Z T: int, 2025-05-07T20:33:27.3804551Z D: int, 2025-05-07T20:33:27.3804773Z scale_ub: Optional[float], 2025-05-07T20:33:27.3805045Z contiguous: bool, 2025-05-07T20:33:27.3805294Z compiled: bool, 2025-05-07T20:33:27.3805519Z ) -> None: 2025-05-07T20:33:27.3805734Z torch.manual_seed(2025) 2025-05-07T20:33:27.3805982Z 2025-05-07T20:33:27.3806258Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:27.3806609Z 2025-05-07T20:33:27.3806805Z x_sign = torch.sign(x) 2025-05-07T20:33:27.3807099Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:27.3807425Z x = x_sign * x_clamp 2025-05-07T20:33:27.3807687Z x0 = x[:, :D] 2025-05-07T20:33:27.3807925Z x1 = x[:, D:] 2025-05-07T20:33:27.3808134Z 2025-05-07T20:33:27.3808319Z if contiguous: 2025-05-07T20:33:27.3808550Z x0 = x0.contiguous() 2025-05-07T20:33:27.3808813Z x1 = x1.contiguous() 2025-05-07T20:33:27.3809050Z 2025-05-07T20:33:27.3809241Z if scale_ub is not None: 2025-05-07T20:33:27.3809521Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:27.3809857Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:27.3810181Z ) 2025-05-07T20:33:27.3810372Z else: 2025-05-07T20:33:27.3810575Z scale_ub_tensor = None 2025-05-07T20:33:27.3810835Z 2025-05-07T20:33:27.3811068Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:27.3811385Z op = silu_mul_quant 2025-05-07T20:33:27.3811641Z if compiled: 2025-05-07T20:33:27.3811894Z op = torch.compile(op) 2025-05-07T20:33:27.3812198Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:27.3812475Z 2025-05-07T20:33:27.3812722Z > y_fp8, y_scale = fn() 2025-05-07T20:33:27.3812888Z 2025-05-07T20:33:27.3812990Z moe/activation_test.py:117: 2025-05-07T20:33:27.3813285Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:27.3813631Z moe/activation_test.py:115: in fn 2025-05-07T20:33:27.3813916Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:27.3814650Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:27.3815396Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:27.3815961Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:27.3816694Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:27.3817402Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:27.3817967Z kernel = self.compile( 2025-05-07T20:33:27.3818546Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:27.3819240Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:27.3819652Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:27.3819949Z 2025-05-07T20:33:27.3820161Z self = 2025-05-07T20:33:27.3821372Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:27.3822874Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f58fdde2310>} 2025-05-07T20:33:27.3824380Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:27.3825487Z context = 2025-05-07T20:33:27.3825798Z 2025-05-07T20:33:27.3825971Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:27.3826522Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:27.3827009Z module_map=module_map) 2025-05-07T20:33:27.3827389Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:27.3827754Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:27.3828019Z E ^ 2025-05-07T20:33:27.3828513Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:27.3829012Z 2025-05-07T20:33:27.3829462Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:27.3830157Z 2025-05-07T20:33:27.3830266Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:27.3830684Z self=, 2025-05-07T20:33:27.3831106Z T=128, 2025-05-07T20:33:27.3831293Z D=5120, 2025-05-07T20:33:27.3831476Z scale_ub=1200.0, 2025-05-07T20:33:27.3831702Z contiguous=True, 2025-05-07T20:33:27.3831930Z compiled=False, 2025-05-07T20:33:27.3832129Z ) 2025-05-07T20:33:27.6129379Z self = 2025-05-07T20:33:27.6129928Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:27.6130302Z 2025-05-07T20:33:27.6130412Z @given( 2025-05-07T20:33:27.6130726Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:27.6131048Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:27.6131602Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:27.6131943Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:27.6132279Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:27.6132565Z ) 2025-05-07T20:33:27.6132929Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:27.6133396Z def test_silu_mul_quant( 2025-05-07T20:33:27.6133644Z self, 2025-05-07T20:33:27.6133839Z T: int, 2025-05-07T20:33:27.6134037Z D: int, 2025-05-07T20:33:27.6134254Z scale_ub: Optional[float], 2025-05-07T20:33:27.6134529Z contiguous: bool, 2025-05-07T20:33:27.6134772Z compiled: bool, 2025-05-07T20:33:27.6135002Z ) -> None: 2025-05-07T20:33:27.6135213Z torch.manual_seed(2025) 2025-05-07T20:33:27.6135455Z 2025-05-07T20:33:27.6135727Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:27.6136075Z 2025-05-07T20:33:27.6136271Z x_sign = torch.sign(x) 2025-05-07T20:33:27.6136568Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:27.6136883Z x = x_sign * x_clamp 2025-05-07T20:33:27.6137132Z x0 = x[:, :D] 2025-05-07T20:33:27.6137355Z x1 = x[:, D:] 2025-05-07T20:33:27.6137560Z 2025-05-07T20:33:27.6137874Z if contiguous: 2025-05-07T20:33:27.6138113Z x0 = x0.contiguous() 2025-05-07T20:33:27.6138453Z x1 = x1.contiguous() 2025-05-07T20:33:27.6138695Z 2025-05-07T20:33:27.6138896Z if scale_ub is not None: 2025-05-07T20:33:27.6139167Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:27.6139510Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:27.6139833Z ) 2025-05-07T20:33:27.6140103Z else: 2025-05-07T20:33:27.6140318Z scale_ub_tensor = None 2025-05-07T20:33:27.6140583Z 2025-05-07T20:33:27.6140812Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:27.6141133Z op = silu_mul_quant 2025-05-07T20:33:27.6141390Z if compiled: 2025-05-07T20:33:27.6141639Z op = torch.compile(op) 2025-05-07T20:33:27.6141937Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:27.6142221Z 2025-05-07T20:33:27.6142419Z > y_fp8, y_scale = fn() 2025-05-07T20:33:27.6142584Z 2025-05-07T20:33:27.6142681Z moe/activation_test.py:117: 2025-05-07T20:33:27.6142982Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:27.6143328Z moe/activation_test.py:115: in fn 2025-05-07T20:33:27.6143608Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:27.6144366Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:27.6145116Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:27.6145689Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:27.6146420Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:27.6147124Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:27.6147694Z kernel = self.compile( 2025-05-07T20:33:27.6148269Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:27.6148970Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:27.6149381Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:27.6149626Z 2025-05-07T20:33:27.6149975Z self = 2025-05-07T20:33:27.6151203Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:27.6152719Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f58fdde2ee0>} 2025-05-07T20:33:27.6154183Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:27.6155295Z context = 2025-05-07T20:33:27.6155605Z 2025-05-07T20:33:27.6155773Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:27.6156322Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:27.6156810Z module_map=module_map) 2025-05-07T20:33:27.6157191Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:27.6157557Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:27.6157815Z E ^ 2025-05-07T20:33:27.6158307Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:27.6158801Z 2025-05-07T20:33:27.6159296Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:27.6159890Z 2025-05-07T20:33:27.6160003Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:27.6160422Z self=, 2025-05-07T20:33:27.6160838Z T=1, 2025-05-07T20:33:27.6161021Z D=7168, 2025-05-07T20:33:27.6161208Z scale_ub=1200.0, 2025-05-07T20:33:27.6161476Z contiguous=True, 2025-05-07T20:33:27.6161706Z compiled=True, 2025-05-07T20:33:27.6161910Z ) 2025-05-07T20:33:27.6162230Z self = 2025-05-07T20:33:27.6162747Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:27.6163020Z 2025-05-07T20:33:27.6163104Z @given( 2025-05-07T20:33:27.6163332Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:27.6163655Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:27.6163974Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:27.6164314Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:27.6164653Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:27.6164949Z ) 2025-05-07T20:33:27.6165312Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:27.6165776Z def test_silu_mul_quant( 2025-05-07T20:33:27.6166020Z self, 2025-05-07T20:33:27.6166213Z T: int, 2025-05-07T20:33:27.6166405Z D: int, 2025-05-07T20:33:27.6166621Z scale_ub: Optional[float], 2025-05-07T20:33:27.6166895Z contiguous: bool, 2025-05-07T20:33:27.6167132Z compiled: bool, 2025-05-07T20:33:27.6167355Z ) -> None: 2025-05-07T20:33:27.6167568Z torch.manual_seed(2025) 2025-05-07T20:33:27.6167808Z 2025-05-07T20:33:27.6168082Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:27.6168437Z 2025-05-07T20:33:27.6168623Z x_sign = torch.sign(x) 2025-05-07T20:33:27.6168916Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:27.6169236Z x = x_sign * x_clamp 2025-05-07T20:33:27.6169489Z x0 = x[:, :D] 2025-05-07T20:33:27.6176546Z x1 = x[:, D:] 2025-05-07T20:33:27.6176779Z 2025-05-07T20:33:27.6176967Z if contiguous: 2025-05-07T20:33:27.6177212Z x0 = x0.contiguous() 2025-05-07T20:33:27.6177519Z x1 = x1.contiguous() 2025-05-07T20:33:27.6177786Z 2025-05-07T20:33:27.6177984Z if scale_ub is not None: 2025-05-07T20:33:27.6178272Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:27.6178699Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:27.6179032Z ) 2025-05-07T20:33:27.6179234Z else: 2025-05-07T20:33:27.6179443Z scale_ub_tensor = None 2025-05-07T20:33:27.6179707Z 2025-05-07T20:33:27.6179945Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:27.6180276Z op = silu_mul_quant 2025-05-07T20:33:27.6180541Z if compiled: 2025-05-07T20:33:27.6180798Z op = torch.compile(op) 2025-05-07T20:33:27.6181111Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:27.6181395Z 2025-05-07T20:33:27.6181601Z > y_fp8, y_scale = fn() 2025-05-07T20:33:27.6181771Z 2025-05-07T20:33:27.6181882Z moe/activation_test.py:117: 2025-05-07T20:33:27.6182184Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:27.6182541Z moe/activation_test.py:115: in fn 2025-05-07T20:33:27.6183138Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:27.6183744Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:27.6184356Z return fn(*args, **kwargs) 2025-05-07T20:33:27.6185170Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:27.6185939Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:27.6186573Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:27.6187300Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:27.6188010Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:27.6188633Z kernel = self.compile( 2025-05-07T20:33:27.6189206Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:27.6190030Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:27.6190449Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:27.6190690Z 2025-05-07T20:33:27.6190905Z self = 2025-05-07T20:33:27.6192072Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:27.6193578Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f58fe252940>} 2025-05-07T20:33:27.6195052Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:27.6196177Z context = 2025-05-07T20:33:27.6196483Z 2025-05-07T20:33:27.6196662Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:27.6197210Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:27.6197703Z module_map=module_map) 2025-05-07T20:33:27.6198084Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:27.6198448Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:27.6198707Z E ^ 2025-05-07T20:33:27.6199203Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:27.6199695Z 2025-05-07T20:33:27.6200150Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:27.6200703Z 2025-05-07T20:33:27.6200876Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:27.6201304Z self=, 2025-05-07T20:33:27.6201732Z T=1, 2025-05-07T20:33:27.6201915Z D=7168, 2025-05-07T20:33:27.6202099Z scale_ub=1200.0, 2025-05-07T20:33:27.6202324Z contiguous=False, 2025-05-07T20:33:27.6202555Z compiled=True, 2025-05-07T20:33:27.6202756Z ) 2025-05-07T20:33:27.9555913Z self = 2025-05-07T20:33:27.9557319Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:27.9557756Z 2025-05-07T20:33:27.9557857Z @given( 2025-05-07T20:33:27.9558110Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:27.9558446Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:27.9558760Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:27.9559095Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:27.9559441Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:27.9559740Z ) 2025-05-07T20:33:27.9560098Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:27.9560566Z def test_silu_mul_quant( 2025-05-07T20:33:27.9560809Z self, 2025-05-07T20:33:27.9561239Z T: int, 2025-05-07T20:33:27.9561439Z D: int, 2025-05-07T20:33:27.9561741Z scale_ub: Optional[float], 2025-05-07T20:33:27.9562018Z contiguous: bool, 2025-05-07T20:33:27.9562256Z compiled: bool, 2025-05-07T20:33:27.9562484Z ) -> None: 2025-05-07T20:33:27.9562706Z torch.manual_seed(2025) 2025-05-07T20:33:27.9562945Z 2025-05-07T20:33:27.9563223Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:27.9563662Z 2025-05-07T20:33:27.9563857Z x_sign = torch.sign(x) 2025-05-07T20:33:27.9564153Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:27.9564480Z x = x_sign * x_clamp 2025-05-07T20:33:27.9564721Z x0 = x[:, :D] 2025-05-07T20:33:27.9564939Z x1 = x[:, D:] 2025-05-07T20:33:27.9565157Z 2025-05-07T20:33:27.9565341Z if contiguous: 2025-05-07T20:33:27.9565576Z x0 = x0.contiguous() 2025-05-07T20:33:27.9565843Z x1 = x1.contiguous() 2025-05-07T20:33:27.9566085Z 2025-05-07T20:33:27.9566275Z if scale_ub is not None: 2025-05-07T20:33:27.9566564Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:27.9566907Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:27.9567230Z ) 2025-05-07T20:33:27.9567433Z else: 2025-05-07T20:33:27.9567671Z scale_ub_tensor = None 2025-05-07T20:33:27.9567945Z 2025-05-07T20:33:27.9568180Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:27.9568505Z op = silu_mul_quant 2025-05-07T20:33:27.9568751Z if compiled: 2025-05-07T20:33:27.9569003Z op = torch.compile(op) 2025-05-07T20:33:27.9569307Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:27.9569590Z 2025-05-07T20:33:27.9569782Z > y_fp8, y_scale = fn() 2025-05-07T20:33:27.9569951Z 2025-05-07T20:33:27.9570055Z moe/activation_test.py:117: 2025-05-07T20:33:27.9570353Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:27.9570705Z moe/activation_test.py:115: in fn 2025-05-07T20:33:27.9570990Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:27.9571588Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:27.9572183Z return fn(*args, **kwargs) 2025-05-07T20:33:27.9572887Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:27.9573635Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:27.9574274Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:27.9575011Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:27.9575723Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:27.9576293Z kernel = self.compile( 2025-05-07T20:33:27.9576859Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:27.9577564Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:27.9577977Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:27.9578219Z 2025-05-07T20:33:27.9578442Z self = 2025-05-07T20:33:27.9579610Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:27.9581128Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f58fe1f15e0>} 2025-05-07T20:33:27.9582650Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:27.9584073Z context = 2025-05-07T20:33:27.9584381Z 2025-05-07T20:33:27.9584550Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:27.9585175Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:27.9585668Z module_map=module_map) 2025-05-07T20:33:27.9586047Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:27.9586404Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:27.9586666Z E ^ 2025-05-07T20:33:27.9587160Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:27.9587649Z 2025-05-07T20:33:27.9588098Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:27.9588666Z 2025-05-07T20:33:27.9588771Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:27.9589206Z self=, 2025-05-07T20:33:27.9589629Z T=1, 2025-05-07T20:33:27.9589968Z D=7168, 2025-05-07T20:33:27.9590164Z scale_ub=None, 2025-05-07T20:33:27.9590378Z contiguous=False, 2025-05-07T20:33:27.9590596Z compiled=True, 2025-05-07T20:33:27.9590815Z ) 2025-05-07T20:33:28.0723435Z self = 2025-05-07T20:33:28.0724098Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:28.0724540Z 2025-05-07T20:33:28.0724656Z @given( 2025-05-07T20:33:28.0724965Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:28.0725422Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:28.0725740Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:28.0726088Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:28.0726425Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:28.0726724Z ) 2025-05-07T20:33:28.0727077Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:28.0727543Z def test_silu_mul_quant( 2025-05-07T20:33:28.0727791Z self, 2025-05-07T20:33:28.0727978Z T: int, 2025-05-07T20:33:28.0728174Z D: int, 2025-05-07T20:33:28.0728394Z scale_ub: Optional[float], 2025-05-07T20:33:28.0728888Z contiguous: bool, 2025-05-07T20:33:28.0729139Z compiled: bool, 2025-05-07T20:33:28.0729374Z ) -> None: 2025-05-07T20:33:28.0729593Z torch.manual_seed(2025) 2025-05-07T20:33:28.0729830Z 2025-05-07T20:33:28.0730107Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:28.0730467Z 2025-05-07T20:33:28.0730656Z x_sign = torch.sign(x) 2025-05-07T20:33:28.0730955Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:28.0731275Z x = x_sign * x_clamp 2025-05-07T20:33:28.0731510Z x0 = x[:, :D] 2025-05-07T20:33:28.0731730Z x1 = x[:, D:] 2025-05-07T20:33:28.0731940Z 2025-05-07T20:33:28.0732117Z if contiguous: 2025-05-07T20:33:28.0732356Z x0 = x0.contiguous() 2025-05-07T20:33:28.0732627Z x1 = x1.contiguous() 2025-05-07T20:33:28.0732867Z 2025-05-07T20:33:28.0733063Z if scale_ub is not None: 2025-05-07T20:33:28.0733350Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:28.0733687Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:28.0734010Z ) 2025-05-07T20:33:28.0734202Z else: 2025-05-07T20:33:28.0734406Z scale_ub_tensor = None 2025-05-07T20:33:28.0734679Z 2025-05-07T20:33:28.0735020Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:28.0735351Z op = silu_mul_quant 2025-05-07T20:33:28.0735712Z if compiled: 2025-05-07T20:33:28.0735961Z op = torch.compile(op) 2025-05-07T20:33:28.0736267Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:28.0736544Z 2025-05-07T20:33:28.0736737Z y_fp8, y_scale = fn() 2025-05-07T20:33:28.0737027Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:28.0737397Z 2025-05-07T20:33:28.0737628Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:28.0737975Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:28.0738282Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:28.0738602Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:28.0738978Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:28.0739304Z 2025-05-07T20:33:28.0739500Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:28.0739712Z 2025-05-07T20:33:28.0739812Z moe/activation_test.py:126: 2025-05-07T20:33:28.0740121Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:28.0740471Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:28.0740802Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:28.0741648Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:28.0742464Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:28.0743038Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:28.0743768Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:28.0744504Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:28.0745277Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:28.0746084Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:33:28.0746890Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:28.0747673Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:28.0748360Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:28.0749047Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:28.0749604Z fn() 2025-05-07T20:33:28.0750281Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:28.0750905Z self.fn.run( 2025-05-07T20:33:28.0751407Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:28.0751980Z kernel = self.compile( 2025-05-07T20:33:28.0752556Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:28.0753258Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:28.0753681Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:28.0753930Z 2025-05-07T20:33:28.0754156Z self = 2025-05-07T20:33:28.0755333Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:28.0756910Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f58fe034160>} 2025-05-07T20:33:28.0758466Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:28.0759574Z context = 2025-05-07T20:33:28.0759924Z 2025-05-07T20:33:28.0760101Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:28.0760646Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:28.0761142Z module_map=module_map) 2025-05-07T20:33:28.0761518Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:28.0761888Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:28.0762155Z E ^ 2025-05-07T20:33:28.0762656Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:28.0763146Z 2025-05-07T20:33:28.0763600Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:28.0764154Z 2025-05-07T20:33:28.0764258Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:28.0764687Z self=, 2025-05-07T20:33:28.0765115Z T=1, 2025-05-07T20:33:28.0765302Z D=5120, 2025-05-07T20:33:28.0765490Z scale_ub=1200.0, 2025-05-07T20:33:28.0765718Z contiguous=False, 2025-05-07T20:33:28.0765946Z compiled=True, 2025-05-07T20:33:28.0766147Z ) 2025-05-07T20:33:28.2761542Z self = 2025-05-07T20:33:28.2762157Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:28.2762456Z 2025-05-07T20:33:28.2762541Z @given( 2025-05-07T20:33:28.2762812Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:28.2763138Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:28.2763463Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:28.2763807Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:28.2764148Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:28.2764448Z ) 2025-05-07T20:33:28.2764815Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:28.2765293Z def test_silu_mul_quant( 2025-05-07T20:33:28.2765538Z self, 2025-05-07T20:33:28.2765732Z T: int, 2025-05-07T20:33:28.2766218Z D: int, 2025-05-07T20:33:28.2766436Z scale_ub: Optional[float], 2025-05-07T20:33:28.2766716Z contiguous: bool, 2025-05-07T20:33:28.2766962Z compiled: bool, 2025-05-07T20:33:28.2767194Z ) -> None: 2025-05-07T20:33:28.2767420Z torch.manual_seed(2025) 2025-05-07T20:33:28.2767679Z 2025-05-07T20:33:28.2767964Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:28.2768342Z 2025-05-07T20:33:28.2768544Z x_sign = torch.sign(x) 2025-05-07T20:33:28.2768843Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:28.2769175Z x = x_sign * x_clamp 2025-05-07T20:33:28.2769428Z x0 = x[:, :D] 2025-05-07T20:33:28.2769651Z x1 = x[:, D:] 2025-05-07T20:33:28.2769875Z 2025-05-07T20:33:28.2770068Z if contiguous: 2025-05-07T20:33:28.2770312Z x0 = x0.contiguous() 2025-05-07T20:33:28.2770583Z x1 = x1.contiguous() 2025-05-07T20:33:28.2770837Z 2025-05-07T20:33:28.2771039Z if scale_ub is not None: 2025-05-07T20:33:28.2771324Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:28.2771682Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:28.2772015Z ) 2025-05-07T20:33:28.2772210Z else: 2025-05-07T20:33:28.2772516Z scale_ub_tensor = None 2025-05-07T20:33:28.2772788Z 2025-05-07T20:33:28.2773088Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:28.2773421Z op = silu_mul_quant 2025-05-07T20:33:28.2773684Z if compiled: 2025-05-07T20:33:28.2773928Z op = torch.compile(op) 2025-05-07T20:33:28.2774237Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:28.2774526Z 2025-05-07T20:33:28.2774802Z > y_fp8, y_scale = fn() 2025-05-07T20:33:28.2774981Z 2025-05-07T20:33:28.2775081Z moe/activation_test.py:117: 2025-05-07T20:33:28.2775393Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:28.2775746Z moe/activation_test.py:115: in fn 2025-05-07T20:33:28.2776028Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:28.2776626Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:28.2777227Z return fn(*args, **kwargs) 2025-05-07T20:33:28.2777933Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:28.2778678Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:28.2779245Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:28.2779978Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:28.2780687Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:28.2781259Z kernel = self.compile( 2025-05-07T20:33:28.2781836Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:28.2782534Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:28.2783120Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:28.2783371Z 2025-05-07T20:33:28.2783587Z self = 2025-05-07T20:33:28.2784765Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:28.2786330Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f58fe034b80>} 2025-05-07T20:33:28.2787877Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:28.2788981Z context = 2025-05-07T20:33:28.2789292Z 2025-05-07T20:33:28.2789465Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:28.2790162Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:28.2790659Z module_map=module_map) 2025-05-07T20:33:28.2791036Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:28.2791408Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:28.2791682Z E ^ 2025-05-07T20:33:28.2792179Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:28.2792677Z 2025-05-07T20:33:28.2793130Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:28.2793696Z 2025-05-07T20:33:28.2793801Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:28.2794235Z self=, 2025-05-07T20:33:28.2794657Z T=1, 2025-05-07T20:33:28.2794914Z D=5120, 2025-05-07T20:33:28.2795111Z scale_ub=1200.0, 2025-05-07T20:33:28.2795390Z contiguous=False, 2025-05-07T20:33:28.2795617Z compiled=False, 2025-05-07T20:33:28.2795828Z ) 2025-05-07T20:33:28.2796148Z self = 2025-05-07T20:33:28.2796666Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:28.2797010Z 2025-05-07T20:33:28.2797093Z @given( 2025-05-07T20:33:28.2797319Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:28.2797642Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:28.2797962Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:28.2798302Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:28.2798635Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:28.2798935Z ) 2025-05-07T20:33:28.2799303Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:28.2799765Z def test_silu_mul_quant( 2025-05-07T20:33:28.2800015Z self, 2025-05-07T20:33:28.2800213Z T: int, 2025-05-07T20:33:28.2800403Z D: int, 2025-05-07T20:33:28.2800626Z scale_ub: Optional[float], 2025-05-07T20:33:28.2800902Z contiguous: bool, 2025-05-07T20:33:28.2801138Z compiled: bool, 2025-05-07T20:33:28.2801361Z ) -> None: 2025-05-07T20:33:28.2801581Z torch.manual_seed(2025) 2025-05-07T20:33:28.2801821Z 2025-05-07T20:33:28.2802092Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:28.2802447Z 2025-05-07T20:33:28.2802640Z x_sign = torch.sign(x) 2025-05-07T20:33:28.2802929Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:28.2803248Z x = x_sign * x_clamp 2025-05-07T20:33:28.2803491Z x0 = x[:, :D] 2025-05-07T20:33:28.2803703Z x1 = x[:, D:] 2025-05-07T20:33:28.2803912Z 2025-05-07T20:33:28.2804101Z if contiguous: 2025-05-07T20:33:28.2804326Z x0 = x0.contiguous() 2025-05-07T20:33:28.2804589Z x1 = x1.contiguous() 2025-05-07T20:33:28.2804835Z 2025-05-07T20:33:28.2805022Z if scale_ub is not None: 2025-05-07T20:33:28.2805299Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:28.2805642Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:28.2805957Z ) 2025-05-07T20:33:28.2806152Z else: 2025-05-07T20:33:28.2806361Z scale_ub_tensor = None 2025-05-07T20:33:28.2806616Z 2025-05-07T20:33:28.2806846Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:28.2807221Z op = silu_mul_quant 2025-05-07T20:33:28.2807475Z if compiled: 2025-05-07T20:33:28.2807719Z op = torch.compile(op) 2025-05-07T20:33:28.2808071Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:28.2808363Z 2025-05-07T20:33:28.2808553Z > y_fp8, y_scale = fn() 2025-05-07T20:33:28.2808732Z 2025-05-07T20:33:28.2808829Z moe/activation_test.py:117: 2025-05-07T20:33:28.2816171Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:28.2816561Z moe/activation_test.py:115: in fn 2025-05-07T20:33:28.2816856Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:28.2817615Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:28.2818417Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:28.2819002Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:28.2819736Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:28.2820452Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:28.2821019Z kernel = self.compile( 2025-05-07T20:33:28.2821674Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:28.2822419Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:28.2822840Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:28.2823085Z 2025-05-07T20:33:28.2823302Z self = 2025-05-07T20:33:28.2824526Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:28.2826036Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f58fe2b2550>} 2025-05-07T20:33:28.2827507Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:28.2828614Z context = 2025-05-07T20:33:28.2828929Z 2025-05-07T20:33:28.2829102Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:28.2829662Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:28.2830264Z module_map=module_map) 2025-05-07T20:33:28.2830645Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:28.2831016Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:28.2831286Z E ^ 2025-05-07T20:33:28.2831777Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:28.2832273Z 2025-05-07T20:33:28.2832725Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:28.2833296Z 2025-05-07T20:33:28.2833399Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:28.2833835Z self=, 2025-05-07T20:33:28.2834260Z T=16384, 2025-05-07T20:33:28.2834466Z D=5120, 2025-05-07T20:33:28.2834666Z scale_ub=1200.0, 2025-05-07T20:33:28.2834896Z contiguous=False, 2025-05-07T20:33:28.2835129Z compiled=True, 2025-05-07T20:33:28.2835340Z ) 2025-05-07T20:33:28.4007722Z self = 2025-05-07T20:33:28.4008660Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:28.4008980Z 2025-05-07T20:33:28.4009066Z @given( 2025-05-07T20:33:28.4009309Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:28.4009644Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:28.4009976Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:28.4010321Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:28.4010670Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:28.4010970Z ) 2025-05-07T20:33:28.4011340Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:28.4011814Z def test_silu_mul_quant( 2025-05-07T20:33:28.4012070Z self, 2025-05-07T20:33:28.4012266Z T: int, 2025-05-07T20:33:28.4012472Z D: int, 2025-05-07T20:33:28.4012698Z scale_ub: Optional[float], 2025-05-07T20:33:28.4012975Z contiguous: bool, 2025-05-07T20:33:28.4013228Z compiled: bool, 2025-05-07T20:33:28.4013460Z ) -> None: 2025-05-07T20:33:28.4013684Z torch.manual_seed(2025) 2025-05-07T20:33:28.4013932Z 2025-05-07T20:33:28.4014215Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:28.4014583Z 2025-05-07T20:33:28.4014847Z x_sign = torch.sign(x) 2025-05-07T20:33:28.4015150Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:28.4015528Z x = x_sign * x_clamp 2025-05-07T20:33:28.4015772Z x0 = x[:, :D] 2025-05-07T20:33:28.4015994Z x1 = x[:, D:] 2025-05-07T20:33:28.4016202Z 2025-05-07T20:33:28.4016386Z if contiguous: 2025-05-07T20:33:28.4016621Z x0 = x0.contiguous() 2025-05-07T20:33:28.4016944Z x1 = x1.contiguous() 2025-05-07T20:33:28.4017183Z 2025-05-07T20:33:28.4017370Z if scale_ub is not None: 2025-05-07T20:33:28.4017647Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:28.4017985Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:28.4018303Z ) 2025-05-07T20:33:28.4018497Z else: 2025-05-07T20:33:28.4018712Z scale_ub_tensor = None 2025-05-07T20:33:28.4018963Z 2025-05-07T20:33:28.4019191Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:28.4019519Z op = silu_mul_quant 2025-05-07T20:33:28.4019768Z if compiled: 2025-05-07T20:33:28.4020016Z op = torch.compile(op) 2025-05-07T20:33:28.4020324Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:28.4020605Z 2025-05-07T20:33:28.4020796Z > y_fp8, y_scale = fn() 2025-05-07T20:33:28.4020962Z 2025-05-07T20:33:28.4021063Z moe/activation_test.py:117: 2025-05-07T20:33:28.4021364Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:28.4021711Z moe/activation_test.py:115: in fn 2025-05-07T20:33:28.4022001Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:28.4022594Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:28.4023186Z return fn(*args, **kwargs) 2025-05-07T20:33:28.4023896Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:28.4024640Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:28.4025202Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:28.4025930Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:28.4026642Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:28.4027212Z kernel = self.compile( 2025-05-07T20:33:28.4027774Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:28.4028527Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:28.4028943Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:28.4029181Z 2025-05-07T20:33:28.4029401Z self = 2025-05-07T20:33:28.4030704Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:28.4032213Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f58fe22a1f0>} 2025-05-07T20:33:28.4033687Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:28.4034800Z context = 2025-05-07T20:33:28.4035106Z 2025-05-07T20:33:28.4035277Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:28.4035882Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:28.4036372Z module_map=module_map) 2025-05-07T20:33:28.4036787Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:28.4037143Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:28.4037408Z E ^ 2025-05-07T20:33:28.4037902Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:28.4038431Z 2025-05-07T20:33:28.4038881Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:28.4039444Z 2025-05-07T20:33:28.4039547Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:28.4039974Z self=, 2025-05-07T20:33:28.4040391Z T=2048, 2025-05-07T20:33:28.4040567Z D=7168, 2025-05-07T20:33:28.4040754Z scale_ub=1200.0, 2025-05-07T20:33:28.4040979Z contiguous=False, 2025-05-07T20:33:28.4041197Z compiled=True, 2025-05-07T20:33:28.4041402Z ) 2025-05-07T20:33:28.4041728Z self = 2025-05-07T20:33:28.4042242Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:28.4042535Z 2025-05-07T20:33:28.4042612Z @given( 2025-05-07T20:33:28.4042840Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:28.4043164Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:28.4043473Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:28.4043817Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:28.4044155Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:28.4044444Z ) 2025-05-07T20:33:28.4044805Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:28.4045271Z def test_silu_mul_quant( 2025-05-07T20:33:28.4045509Z self, 2025-05-07T20:33:28.4045707Z T: int, 2025-05-07T20:33:28.4045901Z D: int, 2025-05-07T20:33:28.4046115Z scale_ub: Optional[float], 2025-05-07T20:33:28.4046388Z contiguous: bool, 2025-05-07T20:33:28.4046631Z compiled: bool, 2025-05-07T20:33:28.4046853Z ) -> None: 2025-05-07T20:33:28.4047068Z torch.manual_seed(2025) 2025-05-07T20:33:28.4047313Z 2025-05-07T20:33:28.4047582Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:28.4047950Z 2025-05-07T20:33:28.4048142Z x_sign = torch.sign(x) 2025-05-07T20:33:28.4048438Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:28.4048821Z x = x_sign * x_clamp 2025-05-07T20:33:28.4049066Z x0 = x[:, :D] 2025-05-07T20:33:28.4049284Z x1 = x[:, D:] 2025-05-07T20:33:28.4049486Z 2025-05-07T20:33:28.4049671Z if contiguous: 2025-05-07T20:33:28.4049900Z x0 = x0.contiguous() 2025-05-07T20:33:28.4050156Z x1 = x1.contiguous() 2025-05-07T20:33:28.4050399Z 2025-05-07T20:33:28.4050588Z if scale_ub is not None: 2025-05-07T20:33:28.4050867Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:28.4051200Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:28.4051518Z ) 2025-05-07T20:33:28.4051710Z else: 2025-05-07T20:33:28.4051919Z scale_ub_tensor = None 2025-05-07T20:33:28.4052179Z 2025-05-07T20:33:28.4052403Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:28.4052720Z op = silu_mul_quant 2025-05-07T20:33:28.4052973Z if compiled: 2025-05-07T20:33:28.4053225Z op = torch.compile(op) 2025-05-07T20:33:28.4053520Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:28.4053800Z 2025-05-07T20:33:28.4053989Z > y_fp8, y_scale = fn() 2025-05-07T20:33:28.4054156Z 2025-05-07T20:33:28.4054255Z moe/activation_test.py:117: 2025-05-07T20:33:28.4054606Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:28.4054989Z moe/activation_test.py:115: in fn 2025-05-07T20:33:28.4055278Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:28.4055859Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:28.4056456Z return fn(*args, **kwargs) 2025-05-07T20:33:28.4057163Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:28.4057972Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:28.4058540Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:28.4059277Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:28.4059986Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:28.4060549Z kernel = self.compile( 2025-05-07T20:33:28.4061124Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:28.4061821Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:28.4062234Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:28.4062476Z 2025-05-07T20:33:28.4062693Z self = 2025-05-07T20:33:28.4063864Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:28.4065363Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f58fe22aee0>} 2025-05-07T20:33:28.4066828Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:28.4067933Z context = 2025-05-07T20:33:28.4068243Z 2025-05-07T20:33:28.4068418Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:28.4068976Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:28.4069467Z module_map=module_map) 2025-05-07T20:33:28.4070025Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:28.4070392Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:28.4070653Z E ^ 2025-05-07T20:33:28.4071151Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:28.4071648Z 2025-05-07T20:33:28.4072097Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:28.4072655Z 2025-05-07T20:33:28.6752440Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:28.6753796Z self=, 2025-05-07T20:33:28.6754517Z T=1, 2025-05-07T20:33:28.6754810Z D=5120, 2025-05-07T20:33:28.6755146Z scale_ub=None, 2025-05-07T20:33:28.6755507Z contiguous=False, 2025-05-07T20:33:28.6755857Z compiled=False, 2025-05-07T20:33:28.6756197Z ) 2025-05-07T20:33:28.6756726Z self = 2025-05-07T20:33:28.6757559Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:28.6757985Z 2025-05-07T20:33:28.6758105Z @given( 2025-05-07T20:33:28.6758498Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:28.6759295Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:28.6759803Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:28.6760479Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:28.6761023Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:28.6761478Z ) 2025-05-07T20:33:28.6762061Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:28.6762967Z def test_silu_mul_quant( 2025-05-07T20:33:28.6763344Z self, 2025-05-07T20:33:28.6763643Z T: int, 2025-05-07T20:33:28.6763945Z D: int, 2025-05-07T20:33:28.6764290Z scale_ub: Optional[float], 2025-05-07T20:33:28.6764736Z contiguous: bool, 2025-05-07T20:33:28.6765109Z compiled: bool, 2025-05-07T20:33:28.6765466Z ) -> None: 2025-05-07T20:33:28.6765801Z torch.manual_seed(2025) 2025-05-07T20:33:28.6766181Z 2025-05-07T20:33:28.6766623Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:28.6767207Z 2025-05-07T20:33:28.6767510Z x_sign = torch.sign(x) 2025-05-07T20:33:28.6767986Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:28.6768504Z x = x_sign * x_clamp 2025-05-07T20:33:28.6768881Z x0 = x[:, :D] 2025-05-07T20:33:28.6769227Z x1 = x[:, D:] 2025-05-07T20:33:28.6769556Z 2025-05-07T20:33:28.6769844Z if contiguous: 2025-05-07T20:33:28.6770225Z x0 = x0.contiguous() 2025-05-07T20:33:28.6770646Z x1 = x1.contiguous() 2025-05-07T20:33:28.6771037Z 2025-05-07T20:33:28.6771339Z if scale_ub is not None: 2025-05-07T20:33:28.6771788Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:28.6772347Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:28.6772850Z ) 2025-05-07T20:33:28.6773162Z else: 2025-05-07T20:33:28.6773499Z scale_ub_tensor = None 2025-05-07T20:33:28.6773906Z 2025-05-07T20:33:28.6774311Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:28.6774852Z op = silu_mul_quant 2025-05-07T20:33:28.6775260Z if compiled: 2025-05-07T20:33:28.6775677Z op = torch.compile(op) 2025-05-07T20:33:28.6776168Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:28.6776620Z 2025-05-07T20:33:28.6776936Z > y_fp8, y_scale = fn() 2025-05-07T20:33:28.6777211Z 2025-05-07T20:33:28.6777388Z moe/activation_test.py:117: 2025-05-07T20:33:28.6777890Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:28.6778446Z moe/activation_test.py:115: in fn 2025-05-07T20:33:28.6779058Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:28.6780273Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:28.6781486Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:28.6782417Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:28.6783990Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:28.6785146Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:28.6786066Z kernel = self.compile( 2025-05-07T20:33:28.6786976Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:28.6788097Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:28.6788735Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:28.6789129Z 2025-05-07T20:33:28.6789425Z self = 2025-05-07T20:33:28.6791595Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:28.6794078Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f58fe1705e0>} 2025-05-07T20:33:28.6796444Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:28.6798403Z context = 2025-05-07T20:33:28.6798920Z 2025-05-07T20:33:28.6799185Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:28.6800069Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:28.6800879Z module_map=module_map) 2025-05-07T20:33:28.6801475Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:28.6802076Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:28.6802512Z E ^ 2025-05-07T20:33:28.6803304Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:28.6804117Z 2025-05-07T20:33:28.6804846Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:28.6805756Z 2025-05-07T20:33:28.6805926Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:28.6806623Z self=, 2025-05-07T20:33:28.6807303Z T=4096, 2025-05-07T20:33:28.6807608Z D=7168, 2025-05-07T20:33:28.6807924Z scale_ub=1200.0, 2025-05-07T20:33:28.6808285Z contiguous=False, 2025-05-07T20:33:28.6808661Z compiled=False, 2025-05-07T20:33:28.6808995Z ) 2025-05-07T20:33:28.6809520Z self = 2025-05-07T20:33:28.6810366Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:28.6810850Z 2025-05-07T20:33:28.6810986Z @given( 2025-05-07T20:33:28.6811361Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:28.6811879Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:28.6812377Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:28.6812930Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:28.6813477Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:28.6813967Z ) 2025-05-07T20:33:28.6814665Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:28.6815425Z def test_silu_mul_quant( 2025-05-07T20:33:28.6815829Z self, 2025-05-07T20:33:28.6816145Z T: int, 2025-05-07T20:33:28.6816452Z D: int, 2025-05-07T20:33:28.6816808Z scale_ub: Optional[float], 2025-05-07T20:33:28.6817269Z contiguous: bool, 2025-05-07T20:33:28.6817655Z compiled: bool, 2025-05-07T20:33:28.6818028Z ) -> None: 2025-05-07T20:33:28.6818378Z torch.manual_seed(2025) 2025-05-07T20:33:28.6818754Z 2025-05-07T20:33:28.6819198Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:28.6819779Z 2025-05-07T20:33:28.6820101Z x_sign = torch.sign(x) 2025-05-07T20:33:28.6820575Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:28.6821090Z x = x_sign * x_clamp 2025-05-07T20:33:28.6821487Z x0 = x[:, :D] 2025-05-07T20:33:28.6821826Z x1 = x[:, D:] 2025-05-07T20:33:28.6822176Z 2025-05-07T20:33:28.6822478Z if contiguous: 2025-05-07T20:33:28.6822845Z x0 = x0.contiguous() 2025-05-07T20:33:28.6823264Z x1 = x1.contiguous() 2025-05-07T20:33:28.6823650Z 2025-05-07T20:33:28.6823951Z if scale_ub is not None: 2025-05-07T20:33:28.6824480Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:28.6825046Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:28.6825612Z ) 2025-05-07T20:33:28.6825929Z else: 2025-05-07T20:33:28.6826260Z scale_ub_tensor = None 2025-05-07T20:33:28.6826669Z 2025-05-07T20:33:28.6827034Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:28.6827548Z op = silu_mul_quant 2025-05-07T20:33:28.6828034Z if compiled: 2025-05-07T20:33:28.6828430Z op = torch.compile(op) 2025-05-07T20:33:28.6828922Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:28.6829382Z 2025-05-07T20:33:28.6829682Z > y_fp8, y_scale = fn() 2025-05-07T20:33:28.6830093Z 2025-05-07T20:33:28.6830247Z moe/activation_test.py:117: 2025-05-07T20:33:28.6830710Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:28.6831226Z moe/activation_test.py:115: in fn 2025-05-07T20:33:28.6831665Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:28.6832855Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:28.6834056Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:28.6834967Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:28.6836173Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:28.6837338Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:28.6838268Z kernel = self.compile( 2025-05-07T20:33:28.6839197Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:28.6840341Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:28.6841029Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:28.6841429Z 2025-05-07T20:33:28.6841767Z self = 2025-05-07T20:33:28.6843643Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:28.6846029Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f58fdab01f0>} 2025-05-07T20:33:28.6848483Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:28.6850127Z context = 2025-05-07T20:33:28.6850621Z 2025-05-07T20:33:28.6850894Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:28.6851801Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:28.6852609Z module_map=module_map) 2025-05-07T20:33:28.6853199Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:28.6853785Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:28.6854221Z E ^ 2025-05-07T20:33:28.6855015Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:28.6855809Z 2025-05-07T20:33:28.6856534Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:28.6857443Z 2025-05-07T20:33:28.6857608Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:28.6858376Z self=, 2025-05-07T20:33:28.6859055Z T=16384, 2025-05-07T20:33:28.6859422Z D=7168, 2025-05-07T20:33:28.6859734Z scale_ub=None, 2025-05-07T20:33:28.6860069Z contiguous=True, 2025-05-07T20:33:28.6860423Z compiled=True, 2025-05-07T20:33:28.6860752Z ) 2025-05-07T20:33:28.9726412Z self = 2025-05-07T20:33:28.9735021Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:28.9735886Z 2025-05-07T20:33:28.9736017Z @given( 2025-05-07T20:33:28.9736399Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:28.9736943Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:28.9737449Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:28.9738004Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:28.9738567Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:28.9739042Z ) 2025-05-07T20:33:28.9739647Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:28.9740418Z def test_silu_mul_quant( 2025-05-07T20:33:28.9740821Z self, 2025-05-07T20:33:28.9741129Z T: int, 2025-05-07T20:33:28.9741455Z D: int, 2025-05-07T20:33:28.9741811Z scale_ub: Optional[float], 2025-05-07T20:33:28.9742250Z contiguous: bool, 2025-05-07T20:33:28.9742642Z compiled: bool, 2025-05-07T20:33:28.9743029Z ) -> None: 2025-05-07T20:33:28.9743373Z torch.manual_seed(2025) 2025-05-07T20:33:28.9743774Z 2025-05-07T20:33:28.9744218Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:28.9744790Z 2025-05-07T20:33:28.9745109Z x_sign = torch.sign(x) 2025-05-07T20:33:28.9745590Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:28.9746101Z x = x_sign * x_clamp 2025-05-07T20:33:28.9746501Z x0 = x[:, :D] 2025-05-07T20:33:28.9746863Z x1 = x[:, D:] 2025-05-07T20:33:28.9747199Z 2025-05-07T20:33:28.9747507Z if contiguous: 2025-05-07T20:33:28.9747890Z x0 = x0.contiguous() 2025-05-07T20:33:28.9748307Z x1 = x1.contiguous() 2025-05-07T20:33:28.9748711Z 2025-05-07T20:33:28.9749030Z if scale_ub is not None: 2025-05-07T20:33:28.9749485Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:28.9750202Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:28.9750728Z ) 2025-05-07T20:33:28.9751046Z else: 2025-05-07T20:33:28.9751376Z scale_ub_tensor = None 2025-05-07T20:33:28.9751789Z 2025-05-07T20:33:28.9752290Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:28.9752813Z op = silu_mul_quant 2025-05-07T20:33:28.9753213Z if compiled: 2025-05-07T20:33:28.9753595Z op = torch.compile(op) 2025-05-07T20:33:28.9754065Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:28.9754514Z 2025-05-07T20:33:28.9754816Z > y_fp8, y_scale = fn() 2025-05-07T20:33:28.9755084Z 2025-05-07T20:33:28.9755237Z moe/activation_test.py:117: 2025-05-07T20:33:28.9755719Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:28.9756265Z moe/activation_test.py:115: in fn 2025-05-07T20:33:28.9756716Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:28.9757672Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:28.9758652Z return fn(*args, **kwargs) 2025-05-07T20:33:28.9759795Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:28.9760987Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:28.9761902Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:28.9763214Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:28.9764461Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:28.9765382Z kernel = self.compile( 2025-05-07T20:33:28.9766304Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:28.9767438Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:28.9768172Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:28.9768578Z 2025-05-07T20:33:28.9768924Z self = 2025-05-07T20:33:28.9770838Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:28.9773321Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f58fdab0ee0>} 2025-05-07T20:33:28.9775711Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:28.9777526Z context = 2025-05-07T20:33:28.9778029Z 2025-05-07T20:33:28.9778300Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:28.9779193Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:28.9779987Z module_map=module_map) 2025-05-07T20:33:28.9780593Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:28.9781180Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:28.9781611Z E ^ 2025-05-07T20:33:28.9782397Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:28.9783550Z 2025-05-07T20:33:28.9784284Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:28.9785198Z 2025-05-07T20:33:28.9785374Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:28.9786068Z self=, 2025-05-07T20:33:28.9786759Z T=4096, 2025-05-07T20:33:28.9787064Z D=5120, 2025-05-07T20:33:28.9787493Z scale_ub=None, 2025-05-07T20:33:28.9787843Z contiguous=False, 2025-05-07T20:33:28.9788209Z compiled=True, 2025-05-07T20:33:28.9788542Z ) 2025-05-07T20:33:28.9789061Z self = 2025-05-07T20:33:28.9790050Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:28.9790526Z 2025-05-07T20:33:28.9790661Z @given( 2025-05-07T20:33:28.9791030Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:28.9791558Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:28.9792068Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:28.9792613Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:28.9793149Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:28.9793602Z ) 2025-05-07T20:33:28.9794082Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:28.9794664Z def test_silu_mul_quant( 2025-05-07T20:33:28.9795001Z self, 2025-05-07T20:33:28.9795268Z T: int, 2025-05-07T20:33:28.9795536Z D: int, 2025-05-07T20:33:28.9795820Z scale_ub: Optional[float], 2025-05-07T20:33:28.9796185Z contiguous: bool, 2025-05-07T20:33:28.9796497Z compiled: bool, 2025-05-07T20:33:28.9796809Z ) -> None: 2025-05-07T20:33:28.9797228Z torch.manual_seed(2025) 2025-05-07T20:33:28.9797629Z 2025-05-07T20:33:28.9798002Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:28.9798514Z 2025-05-07T20:33:28.9798784Z x_sign = torch.sign(x) 2025-05-07T20:33:28.9799226Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:28.9799687Z x = x_sign * x_clamp 2025-05-07T20:33:28.9800166Z x0 = x[:, :D] 2025-05-07T20:33:28.9800493Z x1 = x[:, D:] 2025-05-07T20:33:28.9800803Z 2025-05-07T20:33:28.9801077Z if contiguous: 2025-05-07T20:33:28.9801404Z x0 = x0.contiguous() 2025-05-07T20:33:28.9801786Z x1 = x1.contiguous() 2025-05-07T20:33:28.9802131Z 2025-05-07T20:33:28.9802400Z if scale_ub is not None: 2025-05-07T20:33:28.9802797Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:28.9803296Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:28.9803731Z ) 2025-05-07T20:33:28.9804035Z else: 2025-05-07T20:33:28.9804330Z scale_ub_tensor = None 2025-05-07T20:33:28.9804686Z 2025-05-07T20:33:28.9805053Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:28.9805564Z op = silu_mul_quant 2025-05-07T20:33:28.9805952Z if compiled: 2025-05-07T20:33:28.9806338Z op = torch.compile(op) 2025-05-07T20:33:28.9806813Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:28.9807249Z 2025-05-07T20:33:28.9807548Z > y_fp8, y_scale = fn() 2025-05-07T20:33:28.9807819Z 2025-05-07T20:33:28.9807976Z moe/activation_test.py:117: 2025-05-07T20:33:28.9808454Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:28.9808984Z moe/activation_test.py:115: in fn 2025-05-07T20:33:28.9809431Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:28.9810364Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:28.9811299Z return fn(*args, **kwargs) 2025-05-07T20:33:28.9812419Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:28.9813595Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:28.9814429Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:28.9815538Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:28.9816727Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:28.9817615Z kernel = self.compile( 2025-05-07T20:33:28.9818509Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:28.9819635Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:28.9820289Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:28.9820680Z 2025-05-07T20:33:28.9821027Z self = 2025-05-07T20:33:28.9822877Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:28.9825282Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f58fe13a940>} 2025-05-07T20:33:28.9827647Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:28.9829516Z context = 2025-05-07T20:33:28.9830133Z 2025-05-07T20:33:28.9830492Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:28.9831360Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:28.9832101Z module_map=module_map) 2025-05-07T20:33:28.9832689Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:28.9833350Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:28.9833782Z E ^ 2025-05-07T20:33:28.9834594Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:28.9835386Z 2025-05-07T20:33:28.9836123Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:28.9837024Z 2025-05-07T20:33:29.1778461Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:29.1779248Z self=, 2025-05-07T20:33:29.1779950Z T=4096, 2025-05-07T20:33:29.1780258Z D=5120, 2025-05-07T20:33:29.1780563Z scale_ub=1200.0, 2025-05-07T20:33:29.1780931Z contiguous=False, 2025-05-07T20:33:29.1781294Z compiled=False, 2025-05-07T20:33:29.1781606Z ) 2025-05-07T20:33:29.1782137Z self = 2025-05-07T20:33:29.1783219Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:29.1783716Z 2025-05-07T20:33:29.1783844Z @given( 2025-05-07T20:33:29.1784229Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:29.1784747Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:29.1785258Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:29.1785813Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:29.1786356Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:29.1786844Z ) 2025-05-07T20:33:29.1787434Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:29.1788190Z def test_silu_mul_quant( 2025-05-07T20:33:29.1788591Z self, 2025-05-07T20:33:29.1788904Z T: int, 2025-05-07T20:33:29.1789226Z D: int, 2025-05-07T20:33:29.1789576Z scale_ub: Optional[float], 2025-05-07T20:33:29.1790155Z contiguous: bool, 2025-05-07T20:33:29.1790554Z compiled: bool, 2025-05-07T20:33:29.1790912Z ) -> None: 2025-05-07T20:33:29.1791262Z torch.manual_seed(2025) 2025-05-07T20:33:29.1791661Z 2025-05-07T20:33:29.1792383Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:29.1792975Z 2025-05-07T20:33:29.1793289Z x_sign = torch.sign(x) 2025-05-07T20:33:29.1793777Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:29.1794302Z x = x_sign * x_clamp 2025-05-07T20:33:29.1794691Z x0 = x[:, :D] 2025-05-07T20:33:29.1795075Z x1 = x[:, D:] 2025-05-07T20:33:29.1795438Z 2025-05-07T20:33:29.1795732Z if contiguous: 2025-05-07T20:33:29.1796126Z x0 = x0.contiguous() 2025-05-07T20:33:29.1796553Z x1 = x1.contiguous() 2025-05-07T20:33:29.1796963Z 2025-05-07T20:33:29.1797270Z if scale_ub is not None: 2025-05-07T20:33:29.1797729Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:29.1798295Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:29.1798805Z ) 2025-05-07T20:33:29.1799119Z else: 2025-05-07T20:33:29.1799461Z scale_ub_tensor = None 2025-05-07T20:33:29.1799886Z 2025-05-07T20:33:29.1800273Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:29.1800800Z op = silu_mul_quant 2025-05-07T20:33:29.1801198Z if compiled: 2025-05-07T20:33:29.1801593Z op = torch.compile(op) 2025-05-07T20:33:29.1802204Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:29.1802657Z 2025-05-07T20:33:29.1803107Z > y_fp8, y_scale = fn() 2025-05-07T20:33:29.1803382Z 2025-05-07T20:33:29.1803548Z moe/activation_test.py:117: 2025-05-07T20:33:29.1804034Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:29.1804582Z moe/activation_test.py:115: in fn 2025-05-07T20:33:29.1805039Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:29.1806329Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:29.1807525Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:29.1808475Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:29.1809638Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:29.1810770Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:29.1811665Z kernel = self.compile( 2025-05-07T20:33:29.1812576Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:29.1813706Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:29.1814375Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:29.1814781Z 2025-05-07T20:33:29.1815118Z self = 2025-05-07T20:33:29.1817011Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:29.1819470Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f58fdbd93a0>} 2025-05-07T20:33:29.1821848Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:29.1823637Z context = 2025-05-07T20:33:29.1824148Z 2025-05-07T20:33:29.1824421Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:29.1825321Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:29.1826204Z module_map=module_map) 2025-05-07T20:33:29.1826795Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:29.1827379Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:29.1827809Z E ^ 2025-05-07T20:33:29.1828609Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:29.1829416Z 2025-05-07T20:33:29.1830297Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:29.1831212Z 2025-05-07T20:33:29.1831378Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:29.1832070Z self=, 2025-05-07T20:33:29.1832754Z T=4096, 2025-05-07T20:33:29.1833063Z D=5120, 2025-05-07T20:33:29.1833372Z scale_ub=1200.0, 2025-05-07T20:33:29.1833731Z contiguous=False, 2025-05-07T20:33:29.1834092Z compiled=True, 2025-05-07T20:33:29.1834416Z ) 2025-05-07T20:33:29.1834935Z self = 2025-05-07T20:33:29.1835781Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:29.1836250Z 2025-05-07T20:33:29.1836383Z @given( 2025-05-07T20:33:29.1836819Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:29.1837333Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:29.1837894Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:29.1838440Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:29.1838977Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:29.1839454Z ) 2025-05-07T20:33:29.1840035Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:29.1840849Z def test_silu_mul_quant( 2025-05-07T20:33:29.1841242Z self, 2025-05-07T20:33:29.1841554Z T: int, 2025-05-07T20:33:29.1841859Z D: int, 2025-05-07T20:33:29.1842207Z scale_ub: Optional[float], 2025-05-07T20:33:29.1842658Z contiguous: bool, 2025-05-07T20:33:29.1843038Z compiled: bool, 2025-05-07T20:33:29.1843397Z ) -> None: 2025-05-07T20:33:29.1843740Z torch.manual_seed(2025) 2025-05-07T20:33:29.1844123Z 2025-05-07T20:33:29.1844565Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:29.1845142Z 2025-05-07T20:33:29.1845452Z x_sign = torch.sign(x) 2025-05-07T20:33:29.1845916Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:29.1846434Z x = x_sign * x_clamp 2025-05-07T20:33:29.1846828Z x0 = x[:, :D] 2025-05-07T20:33:29.1847165Z x1 = x[:, D:] 2025-05-07T20:33:29.1847496Z 2025-05-07T20:33:29.1847796Z if contiguous: 2025-05-07T20:33:29.1848166Z x0 = x0.contiguous() 2025-05-07T20:33:29.1848585Z x1 = x1.contiguous() 2025-05-07T20:33:29.1848962Z 2025-05-07T20:33:29.1849251Z if scale_ub is not None: 2025-05-07T20:33:29.1849651Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:29.1850095Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:29.1850507Z ) 2025-05-07T20:33:29.1850777Z else: 2025-05-07T20:33:29.1851086Z scale_ub_tensor = None 2025-05-07T20:33:29.1851426Z 2025-05-07T20:33:29.1851743Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:29.1852167Z op = silu_mul_quant 2025-05-07T20:33:29.1852514Z if compiled: 2025-05-07T20:33:29.1852844Z op = torch.compile(op) 2025-05-07T20:33:29.1853251Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:29.1853628Z 2025-05-07T20:33:29.1853877Z > y_fp8, y_scale = fn() 2025-05-07T20:33:29.1854121Z 2025-05-07T20:33:29.1854253Z moe/activation_test.py:117: 2025-05-07T20:33:29.1854689Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:29.1855307Z moe/activation_test.py:115: in fn 2025-05-07T20:33:29.1855724Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:29.1856528Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:29.1857377Z return fn(*args, **kwargs) 2025-05-07T20:33:29.1858355Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:29.1859521Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:29.1860412Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:29.1861557Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:29.1862674Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:29.1863511Z kernel = self.compile( 2025-05-07T20:33:29.1864388Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:29.1865450Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:29.1866103Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:29.1866469Z 2025-05-07T20:33:29.1866912Z self = 2025-05-07T20:33:29.1868868Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:29.1871402Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f58fdbd9280>} 2025-05-07T20:33:29.1873802Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:29.1875587Z context = 2025-05-07T20:33:29.1876079Z 2025-05-07T20:33:29.1876368Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:29.1877244Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:29.1878049Z module_map=module_map) 2025-05-07T20:33:29.1878640Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:29.1879228Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:29.1879647Z E ^ 2025-05-07T20:33:29.1880453Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:29.1881213Z 2025-05-07T20:33:29.1881957Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:29.1883174Z 2025-05-07T20:33:29.4640179Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:29.4640975Z self=, 2025-05-07T20:33:29.4641659Z T=2048, 2025-05-07T20:33:29.4641974Z D=7168, 2025-05-07T20:33:29.4642277Z scale_ub=1200.0, 2025-05-07T20:33:29.4642643Z contiguous=False, 2025-05-07T20:33:29.4643002Z compiled=False, 2025-05-07T20:33:29.4643323Z ) 2025-05-07T20:33:29.4643854Z self = 2025-05-07T20:33:29.4644692Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:29.4645201Z 2025-05-07T20:33:29.4645333Z @given( 2025-05-07T20:33:29.4645656Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:29.4646083Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:29.4646752Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:29.4647235Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:29.4647724Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:29.4648147Z ) 2025-05-07T20:33:29.4648687Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:29.4649407Z def test_silu_mul_quant( 2025-05-07T20:33:29.4649761Z self, 2025-05-07T20:33:29.4650038Z T: int, 2025-05-07T20:33:29.4650315Z D: int, 2025-05-07T20:33:29.4650633Z scale_ub: Optional[float], 2025-05-07T20:33:29.4651041Z contiguous: bool, 2025-05-07T20:33:29.4651396Z compiled: bool, 2025-05-07T20:33:29.4651737Z ) -> None: 2025-05-07T20:33:29.4652072Z torch.manual_seed(2025) 2025-05-07T20:33:29.4652448Z 2025-05-07T20:33:29.4652869Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:29.4653398Z 2025-05-07T20:33:29.4653692Z x_sign = torch.sign(x) 2025-05-07T20:33:29.4654151Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:29.4654685Z x = x_sign * x_clamp 2025-05-07T20:33:29.4655074Z x0 = x[:, :D] 2025-05-07T20:33:29.4655400Z x1 = x[:, D:] 2025-05-07T20:33:29.4655717Z 2025-05-07T20:33:29.4656006Z if contiguous: 2025-05-07T20:33:29.4656534Z x0 = x0.contiguous() 2025-05-07T20:33:29.4657050Z x1 = x1.contiguous() 2025-05-07T20:33:29.4657438Z 2025-05-07T20:33:29.4657744Z if scale_ub is not None: 2025-05-07T20:33:29.4658171Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:29.4658718Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:29.4659226Z ) 2025-05-07T20:33:29.4659652Z else: 2025-05-07T20:33:29.4659989Z scale_ub_tensor = None 2025-05-07T20:33:29.4660400Z 2025-05-07T20:33:29.4660759Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:29.4661269Z op = silu_mul_quant 2025-05-07T20:33:29.4670537Z if compiled: 2025-05-07T20:33:29.4670967Z op = torch.compile(op) 2025-05-07T20:33:29.4671458Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:29.4671914Z 2025-05-07T20:33:29.4672219Z > y_fp8, y_scale = fn() 2025-05-07T20:33:29.4672512Z 2025-05-07T20:33:29.4672673Z moe/activation_test.py:117: 2025-05-07T20:33:29.4673169Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:29.4673725Z moe/activation_test.py:115: in fn 2025-05-07T20:33:29.4674192Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:29.4675385Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:29.4676587Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:29.4677516Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:29.4678705Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:29.4679839Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:29.4680725Z kernel = self.compile( 2025-05-07T20:33:29.4681602Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:29.4683096Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:29.4683748Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:29.4684140Z 2025-05-07T20:33:29.4684473Z self = 2025-05-07T20:33:29.4686488Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:29.4688948Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f58fde39670>} 2025-05-07T20:33:29.4691345Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:29.4693121Z context = 2025-05-07T20:33:29.4693633Z 2025-05-07T20:33:29.4693910Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:29.4694816Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:29.4695635Z module_map=module_map) 2025-05-07T20:33:29.4696232Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:29.4696827Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:29.4697260Z E ^ 2025-05-07T20:33:29.4698041Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:29.4698831Z 2025-05-07T20:33:29.4699677Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:29.4700671Z 2025-05-07T20:33:29.4700841Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:29.4701532Z self=, 2025-05-07T20:33:29.4702211Z T=1, 2025-05-07T20:33:29.4702508Z D=7168, 2025-05-07T20:33:29.4702816Z scale_ub=None, 2025-05-07T20:33:29.4703149Z contiguous=True, 2025-05-07T20:33:29.4703620Z compiled=False, 2025-05-07T20:33:29.4703950Z ) 2025-05-07T20:33:29.4704463Z self = 2025-05-07T20:33:29.4705270Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:29.4705720Z 2025-05-07T20:33:29.4705847Z @given( 2025-05-07T20:33:29.4706204Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:29.4706720Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:29.4707228Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:29.4707770Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:29.4708374Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:29.4708844Z ) 2025-05-07T20:33:29.4709426Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:29.4710295Z def test_silu_mul_quant( 2025-05-07T20:33:29.4710685Z self, 2025-05-07T20:33:29.4711002Z T: int, 2025-05-07T20:33:29.4711300Z D: int, 2025-05-07T20:33:29.4711646Z scale_ub: Optional[float], 2025-05-07T20:33:29.4712084Z contiguous: bool, 2025-05-07T20:33:29.4712464Z compiled: bool, 2025-05-07T20:33:29.4712825Z ) -> None: 2025-05-07T20:33:29.4713165Z torch.manual_seed(2025) 2025-05-07T20:33:29.4713548Z 2025-05-07T20:33:29.4713987Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:29.4714561Z 2025-05-07T20:33:29.4714865Z x_sign = torch.sign(x) 2025-05-07T20:33:29.4715341Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:29.4715854Z x = x_sign * x_clamp 2025-05-07T20:33:29.4716229Z x0 = x[:, :D] 2025-05-07T20:33:29.4716573Z x1 = x[:, D:] 2025-05-07T20:33:29.4716900Z 2025-05-07T20:33:29.4717183Z if contiguous: 2025-05-07T20:33:29.4717556Z x0 = x0.contiguous() 2025-05-07T20:33:29.4717978Z x1 = x1.contiguous() 2025-05-07T20:33:29.4718374Z 2025-05-07T20:33:29.4718677Z if scale_ub is not None: 2025-05-07T20:33:29.4719111Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:29.4719732Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:29.4720228Z ) 2025-05-07T20:33:29.4720534Z else: 2025-05-07T20:33:29.4720864Z scale_ub_tensor = None 2025-05-07T20:33:29.4721269Z 2025-05-07T20:33:29.4721640Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:29.4722157Z op = silu_mul_quant 2025-05-07T20:33:29.4722551Z if compiled: 2025-05-07T20:33:29.4722948Z op = torch.compile(op) 2025-05-07T20:33:29.4723424Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:29.4723869Z 2025-05-07T20:33:29.4724168Z > y_fp8, y_scale = fn() 2025-05-07T20:33:29.4724434Z 2025-05-07T20:33:29.4724585Z moe/activation_test.py:117: 2025-05-07T20:33:29.4725063Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:29.4725618Z moe/activation_test.py:115: in fn 2025-05-07T20:33:29.4726085Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:29.4727284Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:29.4728489Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:29.4729413Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:29.4730677Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:29.4731888Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:29.4732811Z kernel = self.compile( 2025-05-07T20:33:29.4733739Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:29.4734928Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:29.4735577Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:29.4735966Z 2025-05-07T20:33:29.4736304Z self = 2025-05-07T20:33:29.4738168Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:29.4740552Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f58fd9d6280>} 2025-05-07T20:33:29.4742811Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:29.4744606Z context = 2025-05-07T20:33:29.4745107Z 2025-05-07T20:33:29.4745383Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:29.4746275Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:29.4747074Z module_map=module_map) 2025-05-07T20:33:29.4747662Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:29.4748247Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:29.4748677Z E ^ 2025-05-07T20:33:29.4749470Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:29.4750383Z 2025-05-07T20:33:29.4751108Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:29.4752024Z 2025-05-07T20:33:29.4752200Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:29.4752894Z self=, 2025-05-07T20:33:29.4753566Z T=16384, 2025-05-07T20:33:29.4753963Z D=7168, 2025-05-07T20:33:29.4754278Z scale_ub=1200.0, 2025-05-07T20:33:29.4754628Z contiguous=False, 2025-05-07T20:33:29.4754986Z compiled=True, 2025-05-07T20:33:29.4755313Z ) 2025-05-07T20:33:29.6664459Z self = 2025-05-07T20:33:29.6665428Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:29.6665911Z 2025-05-07T20:33:29.6666043Z @given( 2025-05-07T20:33:29.6666408Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:29.6666928Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:29.6667438Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:29.6667983Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:29.6668548Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:29.6669035Z ) 2025-05-07T20:33:29.6669520Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:29.6670363Z def test_silu_mul_quant( 2025-05-07T20:33:29.6670711Z self, 2025-05-07T20:33:29.6670980Z T: int, 2025-05-07T20:33:29.6671264Z D: int, 2025-05-07T20:33:29.6671580Z scale_ub: Optional[float], 2025-05-07T20:33:29.6671977Z contiguous: bool, 2025-05-07T20:33:29.6672696Z compiled: bool, 2025-05-07T20:33:29.6673063Z ) -> None: 2025-05-07T20:33:29.6673496Z torch.manual_seed(2025) 2025-05-07T20:33:29.6673858Z 2025-05-07T20:33:29.6674260Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:29.6674784Z 2025-05-07T20:33:29.6675068Z x_sign = torch.sign(x) 2025-05-07T20:33:29.6675528Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:29.6676165Z x = x_sign * x_clamp 2025-05-07T20:33:29.6676524Z x0 = x[:, :D] 2025-05-07T20:33:29.6676851Z x1 = x[:, D:] 2025-05-07T20:33:29.6677176Z 2025-05-07T20:33:29.6677472Z if contiguous: 2025-05-07T20:33:29.6677877Z x0 = x0.contiguous() 2025-05-07T20:33:29.6678304Z x1 = x1.contiguous() 2025-05-07T20:33:29.6678667Z 2025-05-07T20:33:29.6678970Z if scale_ub is not None: 2025-05-07T20:33:29.6679410Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:29.6679961Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:29.6680473Z ) 2025-05-07T20:33:29.6680802Z else: 2025-05-07T20:33:29.6681135Z scale_ub_tensor = None 2025-05-07T20:33:29.6681545Z 2025-05-07T20:33:29.6681921Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:29.6682442Z op = silu_mul_quant 2025-05-07T20:33:29.6683368Z if compiled: 2025-05-07T20:33:29.6683790Z op = torch.compile(op) 2025-05-07T20:33:29.6684274Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:29.6684713Z 2025-05-07T20:33:29.6685016Z > y_fp8, y_scale = fn() 2025-05-07T20:33:29.6685290Z 2025-05-07T20:33:29.6685454Z moe/activation_test.py:117: 2025-05-07T20:33:29.6685935Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:29.6686491Z moe/activation_test.py:115: in fn 2025-05-07T20:33:29.6686957Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:29.6687922Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:29.6688900Z return fn(*args, **kwargs) 2025-05-07T20:33:29.6690044Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:29.6691246Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:29.6692154Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:29.6693347Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:29.6694656Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:29.6695582Z kernel = self.compile( 2025-05-07T20:33:29.6696522Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:29.6697662Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:29.6698340Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:29.6698735Z 2025-05-07T20:33:29.6699077Z self = 2025-05-07T20:33:29.6700977Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:29.6703388Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f58fd9d6ee0>} 2025-05-07T20:33:29.6705695Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:29.6707569Z context = 2025-05-07T20:33:29.6708161Z 2025-05-07T20:33:29.6708430Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:29.6709337Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:29.6710262Z module_map=module_map) 2025-05-07T20:33:29.6710964Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:29.6711543Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:29.6711965Z E ^ 2025-05-07T20:33:29.6712764Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:29.6713569Z 2025-05-07T20:33:29.6714272Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:29.6715169Z 2025-05-07T20:33:29.6715346Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:29.6716038Z self=, 2025-05-07T20:33:29.6716716Z T=1, 2025-05-07T20:33:29.6717008Z D=7168, 2025-05-07T20:33:29.6717307Z scale_ub=None, 2025-05-07T20:33:29.6717650Z contiguous=False, 2025-05-07T20:33:29.6718016Z compiled=False, 2025-05-07T20:33:29.6718345Z ) 2025-05-07T20:33:29.6718860Z self = 2025-05-07T20:33:29.6719696Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:29.6720144Z 2025-05-07T20:33:29.6720273Z @given( 2025-05-07T20:33:29.6720626Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:29.6721138Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:29.6721644Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:29.6722190Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:29.6722733Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:29.6723205Z ) 2025-05-07T20:33:29.6723791Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:29.6724539Z def test_silu_mul_quant( 2025-05-07T20:33:29.6724930Z self, 2025-05-07T20:33:29.6725237Z T: int, 2025-05-07T20:33:29.6725543Z D: int, 2025-05-07T20:33:29.6725887Z scale_ub: Optional[float], 2025-05-07T20:33:29.6726336Z contiguous: bool, 2025-05-07T20:33:29.6726720Z compiled: bool, 2025-05-07T20:33:29.6727069Z ) -> None: 2025-05-07T20:33:29.6727409Z torch.manual_seed(2025) 2025-05-07T20:33:29.6727790Z 2025-05-07T20:33:29.6728373Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:29.6728951Z 2025-05-07T20:33:29.6729249Z x_sign = torch.sign(x) 2025-05-07T20:33:29.6729724Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:29.6730241Z x = x_sign * x_clamp 2025-05-07T20:33:29.6730634Z x0 = x[:, :D] 2025-05-07T20:33:29.6730971Z x1 = x[:, D:] 2025-05-07T20:33:29.6731310Z 2025-05-07T20:33:29.6731606Z if contiguous: 2025-05-07T20:33:29.6731971Z x0 = x0.contiguous() 2025-05-07T20:33:29.6732390Z x1 = x1.contiguous() 2025-05-07T20:33:29.6732783Z 2025-05-07T20:33:29.6733079Z if scale_ub is not None: 2025-05-07T20:33:29.6733511Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:29.6734058Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:29.6734561Z ) 2025-05-07T20:33:29.6734871Z else: 2025-05-07T20:33:29.6735220Z scale_ub_tensor = None 2025-05-07T20:33:29.6735628Z 2025-05-07T20:33:29.6735996Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:29.6736517Z op = silu_mul_quant 2025-05-07T20:33:29.6736916Z if compiled: 2025-05-07T20:33:29.6737313Z op = torch.compile(op) 2025-05-07T20:33:29.6737896Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:29.6738396Z 2025-05-07T20:33:29.6738690Z > y_fp8, y_scale = fn() 2025-05-07T20:33:29.6738968Z 2025-05-07T20:33:29.6739126Z moe/activation_test.py:117: 2025-05-07T20:33:29.6739610Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:29.6740157Z moe/activation_test.py:115: in fn 2025-05-07T20:33:29.6740694Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:29.6741864Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:29.6743068Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:29.6743987Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:29.6745173Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:29.6746286Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:29.6747173Z kernel = self.compile( 2025-05-07T20:33:29.6748109Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:29.6749260Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:29.6750079Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:29.6750486Z 2025-05-07T20:33:29.6750828Z self = 2025-05-07T20:33:29.6752737Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:29.6755213Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f58fdcfd670>} 2025-05-07T20:33:29.6757570Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:29.6759325Z context = 2025-05-07T20:33:29.6759836Z 2025-05-07T20:33:29.6760110Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:29.6761012Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:29.6761878Z module_map=module_map) 2025-05-07T20:33:29.6762431Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:29.6762941Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:29.6763356Z E ^ 2025-05-07T20:33:29.6764140Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:29.6764945Z 2025-05-07T20:33:29.6765670Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:29.6766580Z 2025-05-07T20:33:29.6766746Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:29.6767439Z self=, 2025-05-07T20:33:29.6768118Z T=2048, 2025-05-07T20:33:29.6768419Z D=7168, 2025-05-07T20:33:29.6768727Z scale_ub=None, 2025-05-07T20:33:29.6769069Z contiguous=False, 2025-05-07T20:33:29.6769432Z compiled=True, 2025-05-07T20:33:29.6769760Z ) 2025-05-07T20:33:29.9643936Z self = 2025-05-07T20:33:29.9644877Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:29.9645345Z 2025-05-07T20:33:29.9645476Z @given( 2025-05-07T20:33:29.9646213Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:29.9646820Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:29.9647287Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:29.9647772Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:29.9648278Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:29.9648740Z ) 2025-05-07T20:33:29.9649324Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:29.9650233Z def test_silu_mul_quant( 2025-05-07T20:33:29.9650627Z self, 2025-05-07T20:33:29.9650931Z T: int, 2025-05-07T20:33:29.9651260Z D: int, 2025-05-07T20:33:29.9651614Z scale_ub: Optional[float], 2025-05-07T20:33:29.9652052Z contiguous: bool, 2025-05-07T20:33:29.9652441Z compiled: bool, 2025-05-07T20:33:29.9652813Z ) -> None: 2025-05-07T20:33:29.9653148Z torch.manual_seed(2025) 2025-05-07T20:33:29.9653539Z 2025-05-07T20:33:29.9653987Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:29.9654559Z 2025-05-07T20:33:29.9654865Z x_sign = torch.sign(x) 2025-05-07T20:33:29.9655334Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:29.9655842Z x = x_sign * x_clamp 2025-05-07T20:33:29.9656234Z x0 = x[:, :D] 2025-05-07T20:33:29.9656578Z x1 = x[:, D:] 2025-05-07T20:33:29.9656913Z 2025-05-07T20:33:29.9657210Z if contiguous: 2025-05-07T20:33:29.9657579Z x0 = x0.contiguous() 2025-05-07T20:33:29.9657997Z x1 = x1.contiguous() 2025-05-07T20:33:29.9658388Z 2025-05-07T20:33:29.9658699Z if scale_ub is not None: 2025-05-07T20:33:29.9659146Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:29.9659687Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:29.9660215Z ) 2025-05-07T20:33:29.9660538Z else: 2025-05-07T20:33:29.9660879Z scale_ub_tensor = None 2025-05-07T20:33:29.9661290Z 2025-05-07T20:33:29.9661660Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:29.9662171Z op = silu_mul_quant 2025-05-07T20:33:29.9662578Z if compiled: 2025-05-07T20:33:29.9662981Z op = torch.compile(op) 2025-05-07T20:33:29.9663454Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:29.9663907Z 2025-05-07T20:33:29.9664220Z > y_fp8, y_scale = fn() 2025-05-07T20:33:29.9664489Z 2025-05-07T20:33:29.9664647Z moe/activation_test.py:117: 2025-05-07T20:33:29.9665267Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:29.9665838Z moe/activation_test.py:115: in fn 2025-05-07T20:33:29.9666302Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:29.9667248Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:29.9668226Z return fn(*args, **kwargs) 2025-05-07T20:33:29.9669372Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:29.9670758Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:29.9671675Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:29.9672834Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:29.9673962Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:29.9674860Z kernel = self.compile( 2025-05-07T20:33:29.9675766Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:29.9676871Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:29.9677609Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:29.9678014Z 2025-05-07T20:33:29.9678351Z self = 2025-05-07T20:33:29.9680308Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:29.9683140Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f58fdc6f550>} 2025-05-07T20:33:29.9685516Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:29.9687300Z context = 2025-05-07T20:33:29.9687799Z 2025-05-07T20:33:29.9688074Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:29.9688962Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:29.9689752Z module_map=module_map) 2025-05-07T20:33:29.9690342Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:29.9690926Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:29.9691362Z E ^ 2025-05-07T20:33:29.9692153Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:29.9692954Z 2025-05-07T20:33:29.9693683Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:29.9694596Z 2025-05-07T20:33:29.9694766Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:29.9695458Z self=, 2025-05-07T20:33:29.9696134Z T=4096, 2025-05-07T20:33:29.9696444Z D=7168, 2025-05-07T20:33:29.9696754Z scale_ub=None, 2025-05-07T20:33:29.9697095Z contiguous=False, 2025-05-07T20:33:29.9697460Z compiled=True, 2025-05-07T20:33:29.9697790Z ) 2025-05-07T20:33:29.9698310Z self = 2025-05-07T20:33:29.9699148Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:29.9699630Z 2025-05-07T20:33:29.9699751Z @given( 2025-05-07T20:33:29.9708887Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:29.9709428Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:29.9710149Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:29.9710709Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:29.9711270Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:29.9711748Z ) 2025-05-07T20:33:29.9712333Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:29.9713101Z def test_silu_mul_quant( 2025-05-07T20:33:29.9713510Z self, 2025-05-07T20:33:29.9713815Z T: int, 2025-05-07T20:33:29.9714137Z D: int, 2025-05-07T20:33:29.9714489Z scale_ub: Optional[float], 2025-05-07T20:33:29.9714932Z contiguous: bool, 2025-05-07T20:33:29.9715317Z compiled: bool, 2025-05-07T20:33:29.9715679Z ) -> None: 2025-05-07T20:33:29.9715978Z torch.manual_seed(2025) 2025-05-07T20:33:29.9716300Z 2025-05-07T20:33:29.9716676Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:29.9717145Z 2025-05-07T20:33:29.9717407Z x_sign = torch.sign(x) 2025-05-07T20:33:29.9717805Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:29.9718246Z x = x_sign * x_clamp 2025-05-07T20:33:29.9718568Z x0 = x[:, :D] 2025-05-07T20:33:29.9718872Z x1 = x[:, D:] 2025-05-07T20:33:29.9719156Z 2025-05-07T20:33:29.9719530Z if contiguous: 2025-05-07T20:33:29.9719844Z x0 = x0.contiguous() 2025-05-07T20:33:29.9720290Z x1 = x1.contiguous() 2025-05-07T20:33:29.9720645Z 2025-05-07T20:33:29.9720931Z if scale_ub is not None: 2025-05-07T20:33:29.9721328Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:29.9721798Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:29.9722369Z ) 2025-05-07T20:33:29.9722666Z else: 2025-05-07T20:33:29.9722963Z scale_ub_tensor = None 2025-05-07T20:33:29.9723339Z 2025-05-07T20:33:29.9723683Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:29.9724143Z op = silu_mul_quant 2025-05-07T20:33:29.9724484Z if compiled: 2025-05-07T20:33:29.9724832Z op = torch.compile(op) 2025-05-07T20:33:29.9725279Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:29.9725693Z 2025-05-07T20:33:29.9725978Z > y_fp8, y_scale = fn() 2025-05-07T20:33:29.9726219Z 2025-05-07T20:33:29.9726361Z moe/activation_test.py:117: 2025-05-07T20:33:29.9726805Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:29.9727304Z moe/activation_test.py:115: in fn 2025-05-07T20:33:29.9727685Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:29.9728512Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:29.9729456Z return fn(*args, **kwargs) 2025-05-07T20:33:29.9730567Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:29.9731728Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:29.9732617Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:29.9733764Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:29.9734885Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:29.9735767Z kernel = self.compile( 2025-05-07T20:33:29.9736662Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:29.9737759Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:29.9738413Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:29.9738793Z 2025-05-07T20:33:29.9739120Z self = 2025-05-07T20:33:29.9741066Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:29.9743468Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f58fd965160>} 2025-05-07T20:33:29.9745788Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:29.9747531Z context = 2025-05-07T20:33:29.9748036Z 2025-05-07T20:33:29.9748332Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:29.9749196Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:29.9750112Z module_map=module_map) 2025-05-07T20:33:29.9750681Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:29.9751240Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:29.9751650Z E ^ 2025-05-07T20:33:29.9752472Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:29.9753298Z 2025-05-07T20:33:29.9754002Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:29.9754882Z 2025-05-07T20:33:30.1807197Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:30.1807874Z self=, 2025-05-07T20:33:30.1809378Z T=16384, 2025-05-07T20:33:30.1809757Z D=5120, 2025-05-07T20:33:30.1810144Z scale_ub=1200.0, 2025-05-07T20:33:30.1810591Z contiguous=False, 2025-05-07T20:33:30.1811028Z compiled=False, 2025-05-07T20:33:30.1811430Z ) 2025-05-07T20:33:30.1812077Z self = 2025-05-07T20:33:30.1813124Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:30.1813734Z 2025-05-07T20:33:30.1813899Z @given( 2025-05-07T20:33:30.1814357Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:30.1814990Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:30.1815610Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:30.1816282Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:30.1816948Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:30.1817523Z ) 2025-05-07T20:33:30.1818192Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:30.1818711Z def test_silu_mul_quant( 2025-05-07T20:33:30.1818953Z self, 2025-05-07T20:33:30.1819147Z T: int, 2025-05-07T20:33:30.1819349Z D: int, 2025-05-07T20:33:30.1819563Z scale_ub: Optional[float], 2025-05-07T20:33:30.1819840Z contiguous: bool, 2025-05-07T20:33:30.1820084Z compiled: bool, 2025-05-07T20:33:30.1820309Z ) -> None: 2025-05-07T20:33:30.1820529Z torch.manual_seed(2025) 2025-05-07T20:33:30.1820778Z 2025-05-07T20:33:30.1821050Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:30.1821409Z 2025-05-07T20:33:30.1821601Z x_sign = torch.sign(x) 2025-05-07T20:33:30.1821896Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:30.1822210Z x = x_sign * x_clamp 2025-05-07T20:33:30.1822458Z x0 = x[:, :D] 2025-05-07T20:33:30.1822680Z x1 = x[:, D:] 2025-05-07T20:33:30.1822884Z 2025-05-07T20:33:30.1823071Z if contiguous: 2025-05-07T20:33:30.1823304Z x0 = x0.contiguous() 2025-05-07T20:33:30.1823654Z x1 = x1.contiguous() 2025-05-07T20:33:30.1823902Z 2025-05-07T20:33:30.1824095Z if scale_ub is not None: 2025-05-07T20:33:30.1824366Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:30.1824712Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:30.1825033Z ) 2025-05-07T20:33:30.1825223Z else: 2025-05-07T20:33:30.1825439Z scale_ub_tensor = None 2025-05-07T20:33:30.1825702Z 2025-05-07T20:33:30.1825931Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:30.1826257Z op = silu_mul_quant 2025-05-07T20:33:30.1826518Z if compiled: 2025-05-07T20:33:30.1826797Z op = torch.compile(op) 2025-05-07T20:33:30.1827098Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:30.1827392Z 2025-05-07T20:33:30.1827589Z > y_fp8, y_scale = fn() 2025-05-07T20:33:30.1827759Z 2025-05-07T20:33:30.1827863Z moe/activation_test.py:117: 2025-05-07T20:33:30.1828167Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:30.1828514Z moe/activation_test.py:115: in fn 2025-05-07T20:33:30.1828806Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:30.1829648Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:30.1830542Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:30.1831201Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:30.1831936Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:30.1832645Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:30.1833259Z kernel = self.compile( 2025-05-07T20:33:30.1833838Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:30.1834532Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:30.1834949Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:30.1835199Z 2025-05-07T20:33:30.1835412Z self = 2025-05-07T20:33:30.1836588Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:30.1838117Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f58fd965940>} 2025-05-07T20:33:30.1839584Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:30.1840690Z context = 2025-05-07T20:33:30.1840992Z 2025-05-07T20:33:30.1841170Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:30.1841717Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:30.1842205Z module_map=module_map) 2025-05-07T20:33:30.1842581Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:30.1842947Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:30.1843209Z E ^ 2025-05-07T20:33:30.1843700Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:30.1844190Z 2025-05-07T20:33:30.1844642Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:30.1845243Z 2025-05-07T20:33:30.1845360Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:30.1845785Z self=, 2025-05-07T20:33:30.1846207Z T=16384, 2025-05-07T20:33:30.1846400Z D=5120, 2025-05-07T20:33:30.1846588Z scale_ub=1200.0, 2025-05-07T20:33:30.1846811Z contiguous=True, 2025-05-07T20:33:30.1847037Z compiled=True, 2025-05-07T20:33:30.1847235Z ) 2025-05-07T20:33:30.1847560Z self = 2025-05-07T20:33:30.1848078Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:30.1848371Z 2025-05-07T20:33:30.1848457Z @given( 2025-05-07T20:33:30.1848683Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:30.1849012Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:30.1849329Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:30.1849666Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:30.1850005Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:30.1850304Z ) 2025-05-07T20:33:30.1850664Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:30.1851135Z def test_silu_mul_quant( 2025-05-07T20:33:30.1851383Z self, 2025-05-07T20:33:30.1851627Z T: int, 2025-05-07T20:33:30.1851827Z D: int, 2025-05-07T20:33:30.1852087Z scale_ub: Optional[float], 2025-05-07T20:33:30.1852368Z contiguous: bool, 2025-05-07T20:33:30.1852606Z compiled: bool, 2025-05-07T20:33:30.1852835Z ) -> None: 2025-05-07T20:33:30.1853053Z torch.manual_seed(2025) 2025-05-07T20:33:30.1853295Z 2025-05-07T20:33:30.1853575Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:30.1853974Z 2025-05-07T20:33:30.1854164Z x_sign = torch.sign(x) 2025-05-07T20:33:30.1854462Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:30.1854786Z x = x_sign * x_clamp 2025-05-07T20:33:30.1855027Z x0 = x[:, :D] 2025-05-07T20:33:30.1855250Z x1 = x[:, D:] 2025-05-07T20:33:30.1855461Z 2025-05-07T20:33:30.1855659Z if contiguous: 2025-05-07T20:33:30.1855897Z x0 = x0.contiguous() 2025-05-07T20:33:30.1856164Z x1 = x1.contiguous() 2025-05-07T20:33:30.1856415Z 2025-05-07T20:33:30.1856616Z if scale_ub is not None: 2025-05-07T20:33:30.1856888Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:30.1857237Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:30.1857563Z ) 2025-05-07T20:33:30.1857760Z else: 2025-05-07T20:33:30.1857966Z scale_ub_tensor = None 2025-05-07T20:33:30.1858230Z 2025-05-07T20:33:30.1858467Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:30.1858787Z op = silu_mul_quant 2025-05-07T20:33:30.1859047Z if compiled: 2025-05-07T20:33:30.1859303Z op = torch.compile(op) 2025-05-07T20:33:30.1859603Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:30.1859895Z 2025-05-07T20:33:30.1860091Z > y_fp8, y_scale = fn() 2025-05-07T20:33:30.1860259Z 2025-05-07T20:33:30.1860357Z moe/activation_test.py:117: 2025-05-07T20:33:30.1860667Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:30.1861021Z moe/activation_test.py:115: in fn 2025-05-07T20:33:30.1861307Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:30.1861892Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:30.1862494Z return fn(*args, **kwargs) 2025-05-07T20:33:30.1863198Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:30.1863936Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:30.1864558Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:30.1865291Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:30.1866004Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:30.1866565Z kernel = self.compile( 2025-05-07T20:33:30.1867139Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:30.1867839Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:30.1868252Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:30.1868504Z 2025-05-07T20:33:30.1868716Z self = 2025-05-07T20:33:30.1870010Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:30.1871514Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f58fda0d550>} 2025-05-07T20:33:30.1873028Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:30.1874162Z context = 2025-05-07T20:33:30.1874470Z 2025-05-07T20:33:30.1874639Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:30.1875226Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:30.1875715Z module_map=module_map) 2025-05-07T20:33:30.1876089Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:30.1876459Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:30.1876721Z E ^ 2025-05-07T20:33:30.1877209Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:30.1877707Z 2025-05-07T20:33:30.1878155Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:30.1878719Z 2025-05-07T20:33:30.4116594Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:30.4117268Z self=, 2025-05-07T20:33:30.4117848Z T=16384, 2025-05-07T20:33:30.4118114Z D=5120, 2025-05-07T20:33:30.4118317Z scale_ub=None, 2025-05-07T20:33:30.4118582Z contiguous=False, 2025-05-07T20:33:30.4118893Z compiled=True, 2025-05-07T20:33:30.4119189Z ) 2025-05-07T20:33:30.4119665Z self = 2025-05-07T20:33:30.4120340Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:30.4120642Z 2025-05-07T20:33:30.4120723Z @given( 2025-05-07T20:33:30.4120961Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:30.4121293Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:30.4121609Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:30.4121964Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:30.4122315Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:30.4122613Z ) 2025-05-07T20:33:30.4122983Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:30.4123460Z def test_silu_mul_quant( 2025-05-07T20:33:30.4123713Z self, 2025-05-07T20:33:30.4123907Z T: int, 2025-05-07T20:33:30.4124110Z D: int, 2025-05-07T20:33:30.4124612Z scale_ub: Optional[float], 2025-05-07T20:33:30.4124893Z contiguous: bool, 2025-05-07T20:33:30.4125139Z compiled: bool, 2025-05-07T20:33:30.4125372Z ) -> None: 2025-05-07T20:33:30.4125592Z torch.manual_seed(2025) 2025-05-07T20:33:30.4125843Z 2025-05-07T20:33:30.4126127Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:30.4126483Z 2025-05-07T20:33:30.4126682Z x_sign = torch.sign(x) 2025-05-07T20:33:30.4126984Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:30.4127307Z x = x_sign * x_clamp 2025-05-07T20:33:30.4127564Z x0 = x[:, :D] 2025-05-07T20:33:30.4127790Z x1 = x[:, D:] 2025-05-07T20:33:30.4127999Z 2025-05-07T20:33:30.4128191Z if contiguous: 2025-05-07T20:33:30.4128434Z x0 = x0.contiguous() 2025-05-07T20:33:30.4128695Z x1 = x1.contiguous() 2025-05-07T20:33:30.4128948Z 2025-05-07T20:33:30.4129145Z if scale_ub is not None: 2025-05-07T20:33:30.4129433Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:30.4129777Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:30.4130104Z ) 2025-05-07T20:33:30.4130302Z else: 2025-05-07T20:33:30.4130513Z scale_ub_tensor = None 2025-05-07T20:33:30.4130777Z 2025-05-07T20:33:30.4131137Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:30.4131532Z op = silu_mul_quant 2025-05-07T20:33:30.4131787Z if compiled: 2025-05-07T20:33:30.4132036Z op = torch.compile(op) 2025-05-07T20:33:30.4132336Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:30.4132624Z 2025-05-07T20:33:30.4132820Z > y_fp8, y_scale = fn() 2025-05-07T20:33:30.4133075Z 2025-05-07T20:33:30.4133176Z moe/activation_test.py:117: 2025-05-07T20:33:30.4133479Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:30.4133832Z moe/activation_test.py:115: in fn 2025-05-07T20:33:30.4134128Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:30.4134719Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:30.4135322Z return fn(*args, **kwargs) 2025-05-07T20:33:30.4136034Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:30.4136848Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:30.4137417Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:30.4138150Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:30.4138860Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:30.4139427Z kernel = self.compile( 2025-05-07T20:33:30.4140003Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:30.4140699Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:30.4141106Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:30.4141353Z 2025-05-07T20:33:30.4141568Z self = 2025-05-07T20:33:30.4142748Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:30.4144270Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f58fd7b71f0>} 2025-05-07T20:33:30.4145799Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:30.4146905Z context = 2025-05-07T20:33:30.4147215Z 2025-05-07T20:33:30.4147402Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:30.4147952Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:30.4148444Z module_map=module_map) 2025-05-07T20:33:30.4148814Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:30.4149177Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:30.4149440Z E ^ 2025-05-07T20:33:30.4150088Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:30.4150583Z 2025-05-07T20:33:30.4151034Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:30.4151597Z 2025-05-07T20:33:30.4151702Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:30.4152128Z self=, 2025-05-07T20:33:30.4152542Z T=2048, 2025-05-07T20:33:30.4152735Z D=5120, 2025-05-07T20:33:30.4152978Z scale_ub=None, 2025-05-07T20:33:30.4153192Z contiguous=False, 2025-05-07T20:33:30.4153462Z compiled=True, 2025-05-07T20:33:30.4153665Z ) 2025-05-07T20:33:30.5362051Z self = 2025-05-07T20:33:30.5362767Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:30.5363063Z 2025-05-07T20:33:30.5363141Z @given( 2025-05-07T20:33:30.5363677Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:30.5364020Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:30.5364329Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:30.5364672Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:30.5365014Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:30.5365308Z ) 2025-05-07T20:33:30.5365667Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:30.5366139Z def test_silu_mul_quant( 2025-05-07T20:33:30.5366379Z self, 2025-05-07T20:33:30.5366579Z T: int, 2025-05-07T20:33:30.5366777Z D: int, 2025-05-07T20:33:30.5366997Z scale_ub: Optional[float], 2025-05-07T20:33:30.5367269Z contiguous: bool, 2025-05-07T20:33:30.5367520Z compiled: bool, 2025-05-07T20:33:30.5367754Z ) -> None: 2025-05-07T20:33:30.5367968Z torch.manual_seed(2025) 2025-05-07T20:33:30.5368225Z 2025-05-07T20:33:30.5368515Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:30.5368878Z 2025-05-07T20:33:30.5369077Z x_sign = torch.sign(x) 2025-05-07T20:33:30.5369384Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:30.5369705Z x = x_sign * x_clamp 2025-05-07T20:33:30.5369953Z x0 = x[:, :D] 2025-05-07T20:33:30.5370177Z x1 = x[:, D:] 2025-05-07T20:33:30.5370384Z 2025-05-07T20:33:30.5370578Z if contiguous: 2025-05-07T20:33:30.5370824Z x0 = x0.contiguous() 2025-05-07T20:33:30.5371090Z x1 = x1.contiguous() 2025-05-07T20:33:30.5371343Z 2025-05-07T20:33:30.5371541Z if scale_ub is not None: 2025-05-07T20:33:30.5371820Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:30.5372176Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:30.5372502Z ) 2025-05-07T20:33:30.5372702Z else: 2025-05-07T20:33:30.5372917Z scale_ub_tensor = None 2025-05-07T20:33:30.5373182Z 2025-05-07T20:33:30.5373417Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:30.5380899Z op = silu_mul_quant 2025-05-07T20:33:30.5381312Z if compiled: 2025-05-07T20:33:30.5381583Z op = torch.compile(op) 2025-05-07T20:33:30.5381897Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:30.5382183Z 2025-05-07T20:33:30.5382389Z > y_fp8, y_scale = fn() 2025-05-07T20:33:30.5382568Z 2025-05-07T20:33:30.5382671Z moe/activation_test.py:117: 2025-05-07T20:33:30.5383261Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:30.5383606Z moe/activation_test.py:115: in fn 2025-05-07T20:33:30.5383897Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:30.5384499Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:30.5385100Z return fn(*args, **kwargs) 2025-05-07T20:33:30.5385818Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:30.5386572Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:30.5387143Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:30.5387866Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:30.5388674Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:30.5389329Z kernel = self.compile( 2025-05-07T20:33:30.5390006Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:30.5390709Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:30.5391125Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:30.5391437Z 2025-05-07T20:33:30.5391659Z self = 2025-05-07T20:33:30.5392826Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:30.5394355Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f58fd7b7f70>} 2025-05-07T20:33:30.5395827Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:30.5396937Z context = 2025-05-07T20:33:30.5397243Z 2025-05-07T20:33:30.5397423Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:30.5397967Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:30.5398472Z module_map=module_map) 2025-05-07T20:33:30.5398850Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:30.5399211Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:30.5399480Z E ^ 2025-05-07T20:33:30.5399975Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:30.5400465Z 2025-05-07T20:33:30.5400924Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:30.5401482Z 2025-05-07T20:33:30.5401585Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:30.5402015Z self=, 2025-05-07T20:33:30.5402442Z T=2048, 2025-05-07T20:33:30.5402631Z D=5120, 2025-05-07T20:33:30.5402823Z scale_ub=1200.0, 2025-05-07T20:33:30.5403053Z contiguous=False, 2025-05-07T20:33:30.5403277Z compiled=True, 2025-05-07T20:33:30.5403553Z ) 2025-05-07T20:33:30.5403886Z self = 2025-05-07T20:33:30.5404403Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:30.5404700Z 2025-05-07T20:33:30.5404779Z @given( 2025-05-07T20:33:30.5405016Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:30.5405333Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:30.5405648Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:30.5405986Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:30.5406323Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:30.5406611Z ) 2025-05-07T20:33:30.5406971Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:30.5407440Z def test_silu_mul_quant( 2025-05-07T20:33:30.5407678Z self, 2025-05-07T20:33:30.5407873Z T: int, 2025-05-07T20:33:30.5408073Z D: int, 2025-05-07T20:33:30.5408313Z scale_ub: Optional[float], 2025-05-07T20:33:30.5408616Z contiguous: bool, 2025-05-07T20:33:30.5408861Z compiled: bool, 2025-05-07T20:33:30.5409079Z ) -> None: 2025-05-07T20:33:30.5409296Z torch.manual_seed(2025) 2025-05-07T20:33:30.5409542Z 2025-05-07T20:33:30.5409860Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:30.5410260Z 2025-05-07T20:33:30.5410456Z x_sign = torch.sign(x) 2025-05-07T20:33:30.5410753Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:30.5411062Z x = x_sign * x_clamp 2025-05-07T20:33:30.5411309Z x0 = x[:, :D] 2025-05-07T20:33:30.5411528Z x1 = x[:, D:] 2025-05-07T20:33:30.5411729Z 2025-05-07T20:33:30.5411977Z if contiguous: 2025-05-07T20:33:30.5412210Z x0 = x0.contiguous() 2025-05-07T20:33:30.5412463Z x1 = x1.contiguous() 2025-05-07T20:33:30.5412703Z 2025-05-07T20:33:30.5412900Z if scale_ub is not None: 2025-05-07T20:33:30.5413171Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:30.5413515Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:30.5413833Z ) 2025-05-07T20:33:30.5414022Z else: 2025-05-07T20:33:30.5414237Z scale_ub_tensor = None 2025-05-07T20:33:30.5414498Z 2025-05-07T20:33:30.5414721Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:30.5415049Z op = silu_mul_quant 2025-05-07T20:33:30.5415303Z if compiled: 2025-05-07T20:33:30.5415552Z op = torch.compile(op) 2025-05-07T20:33:30.5415850Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:30.5416140Z 2025-05-07T20:33:30.5416337Z > y_fp8, y_scale = fn() 2025-05-07T20:33:30.5416505Z 2025-05-07T20:33:30.5416602Z moe/activation_test.py:117: 2025-05-07T20:33:30.5416908Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:30.5417256Z moe/activation_test.py:115: in fn 2025-05-07T20:33:30.5417537Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:30.5418127Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:30.5418772Z return fn(*args, **kwargs) 2025-05-07T20:33:30.5419478Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:30.5420215Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:30.5420779Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:30.5421509Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:30.5422213Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:30.5422779Z kernel = self.compile( 2025-05-07T20:33:30.5423399Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:30.5424098Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:30.5424505Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:30.5424755Z 2025-05-07T20:33:30.5424966Z self = 2025-05-07T20:33:30.5426136Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:30.5427639Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f58fd738940>} 2025-05-07T20:33:30.5429108Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:30.5430300Z context = 2025-05-07T20:33:30.5430609Z 2025-05-07T20:33:30.5430827Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:30.5431374Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:30.5431902Z module_map=module_map) 2025-05-07T20:33:30.5432277Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:30.5432638Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:30.5432904Z E ^ 2025-05-07T20:33:30.5433431Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:30.5433927Z 2025-05-07T20:33:30.5434375Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:30.5434926Z 2025-05-07T20:33:30.9424430Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:30.9424996Z self=, 2025-05-07T20:33:30.9425432Z T=4096, 2025-05-07T20:33:30.9425637Z D=5120, 2025-05-07T20:33:30.9425836Z scale_ub=1200.0, 2025-05-07T20:33:30.9426068Z contiguous=True, 2025-05-07T20:33:30.9426300Z compiled=True, 2025-05-07T20:33:30.9426514Z ) 2025-05-07T20:33:30.9426845Z self = 2025-05-07T20:33:30.9427381Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:30.9427687Z 2025-05-07T20:33:30.9427777Z @given( 2025-05-07T20:33:30.9428012Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:30.9428336Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:30.9428666Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:30.9429022Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:30.9429366Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:30.9429667Z ) 2025-05-07T20:33:30.9430216Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:30.9430688Z def test_silu_mul_quant( 2025-05-07T20:33:30.9430945Z self, 2025-05-07T20:33:30.9431147Z T: int, 2025-05-07T20:33:30.9431354Z D: int, 2025-05-07T20:33:30.9431577Z scale_ub: Optional[float], 2025-05-07T20:33:30.9431863Z contiguous: bool, 2025-05-07T20:33:30.9432111Z compiled: bool, 2025-05-07T20:33:30.9432345Z ) -> None: 2025-05-07T20:33:30.9432574Z torch.manual_seed(2025) 2025-05-07T20:33:30.9432831Z 2025-05-07T20:33:30.9433113Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:30.9433485Z 2025-05-07T20:33:30.9433887Z x_sign = torch.sign(x) 2025-05-07T20:33:30.9434185Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:30.9434507Z x = x_sign * x_clamp 2025-05-07T20:33:30.9434754Z x0 = x[:, :D] 2025-05-07T20:33:30.9434969Z x1 = x[:, D:] 2025-05-07T20:33:30.9435181Z 2025-05-07T20:33:30.9435371Z if contiguous: 2025-05-07T20:33:30.9435606Z x0 = x0.contiguous() 2025-05-07T20:33:30.9435876Z x1 = x1.contiguous() 2025-05-07T20:33:30.9436122Z 2025-05-07T20:33:30.9436311Z if scale_ub is not None: 2025-05-07T20:33:30.9436593Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:30.9436947Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:30.9437265Z ) 2025-05-07T20:33:30.9437454Z else: 2025-05-07T20:33:30.9437662Z scale_ub_tensor = None 2025-05-07T20:33:30.9437917Z 2025-05-07T20:33:30.9438144Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:30.9438473Z op = silu_mul_quant 2025-05-07T20:33:30.9438727Z if compiled: 2025-05-07T20:33:30.9438976Z op = torch.compile(op) 2025-05-07T20:33:30.9439281Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:30.9439562Z 2025-05-07T20:33:30.9439753Z > y_fp8, y_scale = fn() 2025-05-07T20:33:30.9440018Z 2025-05-07T20:33:30.9440119Z moe/activation_test.py:117: 2025-05-07T20:33:30.9440531Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:30.9440878Z moe/activation_test.py:115: in fn 2025-05-07T20:33:30.9441157Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:30.9441750Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:30.9442427Z return fn(*args, **kwargs) 2025-05-07T20:33:30.9443130Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:30.9443875Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:30.9444443Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:30.9445174Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:30.9445881Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:30.9446454Z kernel = self.compile( 2025-05-07T20:33:30.9447026Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:30.9447739Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:30.9448155Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:30.9448397Z 2025-05-07T20:33:30.9448610Z self = 2025-05-07T20:33:30.9449785Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:30.9451301Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f58fd705790>} 2025-05-07T20:33:30.9452773Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:30.9453883Z context = 2025-05-07T20:33:30.9454189Z 2025-05-07T20:33:30.9454364Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:30.9454958Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:30.9455457Z module_map=module_map) 2025-05-07T20:33:30.9455828Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:30.9456191Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:30.9456456Z E ^ 2025-05-07T20:33:30.9456952Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:30.9457442Z 2025-05-07T20:33:30.9457889Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:30.9458448Z 2025-05-07T20:33:30.9458551Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:30.9458978Z self=, 2025-05-07T20:33:30.9459398Z T=128, 2025-05-07T20:33:30.9459586Z D=5120, 2025-05-07T20:33:30.9459776Z scale_ub=1200.0, 2025-05-07T20:33:30.9460001Z contiguous=False, 2025-05-07T20:33:30.9460225Z compiled=True, 2025-05-07T20:33:30.9460430Z ) 2025-05-07T20:33:31.0787309Z self = 2025-05-07T20:33:31.0788567Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:31.0788880Z 2025-05-07T20:33:31.0789269Z @given( 2025-05-07T20:33:31.0789506Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:31.0790098Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:31.0790414Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:31.0790749Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:31.0791092Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:31.0791386Z ) 2025-05-07T20:33:31.0791831Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:31.0792299Z def test_silu_mul_quant( 2025-05-07T20:33:31.0792546Z self, 2025-05-07T20:33:31.0792744Z T: int, 2025-05-07T20:33:31.0792939Z D: int, 2025-05-07T20:33:31.0793164Z scale_ub: Optional[float], 2025-05-07T20:33:31.0793444Z contiguous: bool, 2025-05-07T20:33:31.0793683Z compiled: bool, 2025-05-07T20:33:31.0793913Z ) -> None: 2025-05-07T20:33:31.0794131Z torch.manual_seed(2025) 2025-05-07T20:33:31.0794377Z 2025-05-07T20:33:31.0794655Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:31.0795019Z 2025-05-07T20:33:31.0795209Z x_sign = torch.sign(x) 2025-05-07T20:33:31.0795510Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:31.0795831Z x = x_sign * x_clamp 2025-05-07T20:33:31.0796069Z x0 = x[:, :D] 2025-05-07T20:33:31.0796290Z x1 = x[:, D:] 2025-05-07T20:33:31.0796503Z 2025-05-07T20:33:31.0796687Z if contiguous: 2025-05-07T20:33:31.0796921Z x0 = x0.contiguous() 2025-05-07T20:33:31.0797186Z x1 = x1.contiguous() 2025-05-07T20:33:31.0797429Z 2025-05-07T20:33:31.0797624Z if scale_ub is not None: 2025-05-07T20:33:31.0797902Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:31.0798246Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:31.0798561Z ) 2025-05-07T20:33:31.0798760Z else: 2025-05-07T20:33:31.0798975Z scale_ub_tensor = None 2025-05-07T20:33:31.0799231Z 2025-05-07T20:33:31.0799465Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:31.0799800Z op = silu_mul_quant 2025-05-07T20:33:31.0800050Z if compiled: 2025-05-07T20:33:31.0800299Z op = torch.compile(op) 2025-05-07T20:33:31.0800605Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:31.0800884Z 2025-05-07T20:33:31.0801078Z > y_fp8, y_scale = fn() 2025-05-07T20:33:31.0801244Z 2025-05-07T20:33:31.0801348Z moe/activation_test.py:117: 2025-05-07T20:33:31.0801729Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:31.0802080Z moe/activation_test.py:115: in fn 2025-05-07T20:33:31.0802366Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:31.0802959Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:31.0803554Z return fn(*args, **kwargs) 2025-05-07T20:33:31.0804262Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:31.0805016Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:31.0805579Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:31.0806314Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:31.0807028Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:31.0807604Z kernel = self.compile( 2025-05-07T20:33:31.0808173Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:31.0808877Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:31.0809338Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:31.0809580Z 2025-05-07T20:33:31.0809840Z self = 2025-05-07T20:33:31.0811008Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:31.0812572Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f58fd5cc0d0>} 2025-05-07T20:33:31.0814046Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:31.0815296Z context = 2025-05-07T20:33:31.0815629Z 2025-05-07T20:33:31.0815850Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:31.0816409Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:31.0816900Z module_map=module_map) 2025-05-07T20:33:31.0817276Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:31.0817630Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:31.0817901Z E ^ 2025-05-07T20:33:31.0818406Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:31.0818897Z 2025-05-07T20:33:31.0819347Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:31.0819908Z 2025-05-07T20:33:31.0820010Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:31.0820443Z self=, 2025-05-07T20:33:31.0820869Z T=16384, 2025-05-07T20:33:31.0821053Z D=7168, 2025-05-07T20:33:31.0821252Z scale_ub=1200.0, 2025-05-07T20:33:31.0821474Z contiguous=True, 2025-05-07T20:33:31.0821690Z compiled=True, 2025-05-07T20:33:31.0821895Z ) 2025-05-07T20:33:31.0822224Z self = 2025-05-07T20:33:31.0822740Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:31.0823042Z 2025-05-07T20:33:31.0823119Z @given( 2025-05-07T20:33:31.0823351Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:31.0823740Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:31.0824049Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:31.0824388Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:31.0824726Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:31.0825014Z ) 2025-05-07T20:33:31.0825375Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:31.0825841Z def test_silu_mul_quant( 2025-05-07T20:33:31.0826077Z self, 2025-05-07T20:33:31.0826272Z T: int, 2025-05-07T20:33:31.0826469Z D: int, 2025-05-07T20:33:31.0826680Z scale_ub: Optional[float], 2025-05-07T20:33:31.0826954Z contiguous: bool, 2025-05-07T20:33:31.0827195Z compiled: bool, 2025-05-07T20:33:31.0827417Z ) -> None: 2025-05-07T20:33:31.0827633Z torch.manual_seed(2025) 2025-05-07T20:33:31.0827876Z 2025-05-07T20:33:31.0828151Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:31.0828503Z 2025-05-07T20:33:31.0828698Z x_sign = torch.sign(x) 2025-05-07T20:33:31.0828990Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:31.0829302Z x = x_sign * x_clamp 2025-05-07T20:33:31.0829548Z x0 = x[:, :D] 2025-05-07T20:33:31.0829899Z x1 = x[:, D:] 2025-05-07T20:33:31.0830177Z 2025-05-07T20:33:31.0830366Z if contiguous: 2025-05-07T20:33:31.0830637Z x0 = x0.contiguous() 2025-05-07T20:33:31.0830893Z x1 = x1.contiguous() 2025-05-07T20:33:31.0831138Z 2025-05-07T20:33:31.0831329Z if scale_ub is not None: 2025-05-07T20:33:31.0831598Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:31.0831939Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:31.0832302Z ) 2025-05-07T20:33:31.0832486Z else: 2025-05-07T20:33:31.0832701Z scale_ub_tensor = None 2025-05-07T20:33:31.0832959Z 2025-05-07T20:33:31.0833196Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:31.0833521Z op = silu_mul_quant 2025-05-07T20:33:31.0833779Z if compiled: 2025-05-07T20:33:31.0834031Z op = torch.compile(op) 2025-05-07T20:33:31.0834329Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:31.0834609Z 2025-05-07T20:33:31.0834807Z > y_fp8, y_scale = fn() 2025-05-07T20:33:31.0834981Z 2025-05-07T20:33:31.0835079Z moe/activation_test.py:117: 2025-05-07T20:33:31.0835383Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:31.0835730Z moe/activation_test.py:115: in fn 2025-05-07T20:33:31.0836011Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:31.0836601Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:31.0837211Z return fn(*args, **kwargs) 2025-05-07T20:33:31.0837920Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:31.0838653Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:31.0839223Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:31.0839955Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:31.0840665Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:31.0841227Z kernel = self.compile( 2025-05-07T20:33:31.0841794Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:31.0842497Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:31.0842905Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:31.0843157Z 2025-05-07T20:33:31.0843423Z self = 2025-05-07T20:33:31.0844590Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:31.0846101Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f58fd5ccd30>} 2025-05-07T20:33:31.0847571Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:31.0848675Z context = 2025-05-07T20:33:31.0848984Z 2025-05-07T20:33:31.0849153Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:31.0849706Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:31.0850198Z module_map=module_map) 2025-05-07T20:33:31.0850568Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:31.0850932Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:31.0851277Z E ^ 2025-05-07T20:33:31.0851766Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:31.0852301Z 2025-05-07T20:33:31.0852748Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:31.0853307Z 2025-05-07T20:33:31.3611397Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:31.3612321Z self=, 2025-05-07T20:33:31.3612778Z T=16384, 2025-05-07T20:33:31.3612980Z D=5120, 2025-05-07T20:33:31.3613170Z scale_ub=1200.0, 2025-05-07T20:33:31.3613403Z contiguous=True, 2025-05-07T20:33:31.3621088Z compiled=False, 2025-05-07T20:33:31.3621333Z ) 2025-05-07T20:33:31.3621672Z self = 2025-05-07T20:33:31.3622223Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:31.3622545Z 2025-05-07T20:33:31.3622633Z @given( 2025-05-07T20:33:31.3622892Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:31.3623217Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:31.3623539Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:31.3623885Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:31.3624221Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:31.3624529Z ) 2025-05-07T20:33:31.3624898Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:31.3625367Z def test_silu_mul_quant( 2025-05-07T20:33:31.3625623Z self, 2025-05-07T20:33:31.3625824Z T: int, 2025-05-07T20:33:31.3626023Z D: int, 2025-05-07T20:33:31.3626252Z scale_ub: Optional[float], 2025-05-07T20:33:31.3626538Z contiguous: bool, 2025-05-07T20:33:31.3626788Z compiled: bool, 2025-05-07T20:33:31.3627015Z ) -> None: 2025-05-07T20:33:31.3627240Z torch.manual_seed(2025) 2025-05-07T20:33:31.3627496Z 2025-05-07T20:33:31.3627779Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:31.3628145Z 2025-05-07T20:33:31.3628349Z x_sign = torch.sign(x) 2025-05-07T20:33:31.3628647Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:31.3628977Z x = x_sign * x_clamp 2025-05-07T20:33:31.3629232Z x0 = x[:, :D] 2025-05-07T20:33:31.3629448Z x1 = x[:, D:] 2025-05-07T20:33:31.3629663Z 2025-05-07T20:33:31.3629976Z if contiguous: 2025-05-07T20:33:31.3630210Z x0 = x0.contiguous() 2025-05-07T20:33:31.3630649Z x1 = x1.contiguous() 2025-05-07T20:33:31.3630906Z 2025-05-07T20:33:31.3631104Z if scale_ub is not None: 2025-05-07T20:33:31.3631390Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:31.3631738Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:31.3632068Z ) 2025-05-07T20:33:31.3632263Z else: 2025-05-07T20:33:31.3632475Z scale_ub_tensor = None 2025-05-07T20:33:31.3632740Z 2025-05-07T20:33:31.3632979Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:31.3633310Z op = silu_mul_quant 2025-05-07T20:33:31.3633561Z if compiled: 2025-05-07T20:33:31.3633818Z op = torch.compile(op) 2025-05-07T20:33:31.3634129Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:31.3634413Z 2025-05-07T20:33:31.3634609Z > y_fp8, y_scale = fn() 2025-05-07T20:33:31.3634789Z 2025-05-07T20:33:31.3634891Z moe/activation_test.py:117: 2025-05-07T20:33:31.3635202Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:31.3635548Z moe/activation_test.py:115: in fn 2025-05-07T20:33:31.3635843Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:31.3636673Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:31.3637511Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:31.3638077Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:31.3638809Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:31.3639521Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:31.3640134Z kernel = self.compile( 2025-05-07T20:33:31.3640717Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:31.3641421Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:31.3641828Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:31.3642080Z 2025-05-07T20:33:31.3642296Z self = 2025-05-07T20:33:31.3643468Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:31.3644998Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f58fd7fe700>} 2025-05-07T20:33:31.3646471Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:31.3647567Z context = 2025-05-07T20:33:31.3647880Z 2025-05-07T20:33:31.3648051Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:31.3648599Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:31.3649094Z module_map=module_map) 2025-05-07T20:33:31.3649465Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:31.3649830Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:31.3650097Z E ^ 2025-05-07T20:33:31.3650583Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:31.3651080Z 2025-05-07T20:33:31.3651579Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:31.3652140Z 2025-05-07T20:33:31.3652243Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:31.3652672Z self=, 2025-05-07T20:33:31.3653093Z T=1, 2025-05-07T20:33:31.3653280Z D=7168, 2025-05-07T20:33:31.3653477Z scale_ub=1200.0, 2025-05-07T20:33:31.3653702Z contiguous=False, 2025-05-07T20:33:31.3653937Z compiled=False, 2025-05-07T20:33:31.3654152Z ) 2025-05-07T20:33:31.3654472Z self = 2025-05-07T20:33:31.3654993Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:31.3655283Z 2025-05-07T20:33:31.3655362Z @given( 2025-05-07T20:33:31.3655598Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:31.3655918Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:31.3656236Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:31.3656586Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:31.3656916Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:31.3657221Z ) 2025-05-07T20:33:31.3657585Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:31.3658046Z def test_silu_mul_quant( 2025-05-07T20:33:31.3658344Z self, 2025-05-07T20:33:31.3658540Z T: int, 2025-05-07T20:33:31.3658772Z D: int, 2025-05-07T20:33:31.3658991Z scale_ub: Optional[float], 2025-05-07T20:33:31.3659270Z contiguous: bool, 2025-05-07T20:33:31.3659504Z compiled: bool, 2025-05-07T20:33:31.3659728Z ) -> None: 2025-05-07T20:33:31.3659945Z torch.manual_seed(2025) 2025-05-07T20:33:31.3660193Z 2025-05-07T20:33:31.3660507Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:31.3660864Z 2025-05-07T20:33:31.3661059Z x_sign = torch.sign(x) 2025-05-07T20:33:31.3661354Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:31.3661674Z x = x_sign * x_clamp 2025-05-07T20:33:31.3661920Z x0 = x[:, :D] 2025-05-07T20:33:31.3662130Z x1 = x[:, D:] 2025-05-07T20:33:31.3662340Z 2025-05-07T20:33:31.3662525Z if contiguous: 2025-05-07T20:33:31.3662750Z x0 = x0.contiguous() 2025-05-07T20:33:31.3663018Z x1 = x1.contiguous() 2025-05-07T20:33:31.3663261Z 2025-05-07T20:33:31.3663447Z if scale_ub is not None: 2025-05-07T20:33:31.3663726Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:31.3664064Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:31.3664379Z ) 2025-05-07T20:33:31.3664573Z else: 2025-05-07T20:33:31.3664780Z scale_ub_tensor = None 2025-05-07T20:33:31.3665033Z 2025-05-07T20:33:31.3665266Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:31.3665603Z op = silu_mul_quant 2025-05-07T20:33:31.3665859Z if compiled: 2025-05-07T20:33:31.3666103Z op = torch.compile(op) 2025-05-07T20:33:31.3666407Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:31.3666688Z 2025-05-07T20:33:31.3666876Z > y_fp8, y_scale = fn() 2025-05-07T20:33:31.3667050Z 2025-05-07T20:33:31.3667150Z moe/activation_test.py:117: 2025-05-07T20:33:31.3667453Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:31.3667792Z moe/activation_test.py:115: in fn 2025-05-07T20:33:31.3668079Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:31.3668815Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:31.3669556Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:31.3670257Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:31.3671205Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:31.3672006Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:31.3672575Z kernel = self.compile( 2025-05-07T20:33:31.3673157Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:31.3673862Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:31.3674282Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:31.3674525Z 2025-05-07T20:33:31.3674741Z self = 2025-05-07T20:33:31.3675910Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:31.3677417Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f58fd54f0d0>} 2025-05-07T20:33:31.3678933Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:31.3680077Z context = 2025-05-07T20:33:31.3680380Z 2025-05-07T20:33:31.3680550Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:31.3681098Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:31.3681630Z module_map=module_map) 2025-05-07T20:33:31.3681999Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:31.3682365Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:31.3682634Z E ^ 2025-05-07T20:33:31.3683494Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:31.3683986Z 2025-05-07T20:33:31.3684435Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:31.3685001Z 2025-05-07T20:33:31.3685102Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:31.3685532Z self=, 2025-05-07T20:33:31.3685950Z T=4096, 2025-05-07T20:33:31.3686143Z D=7168, 2025-05-07T20:33:31.3686335Z scale_ub=1200.0, 2025-05-07T20:33:31.3686558Z contiguous=False, 2025-05-07T20:33:31.3686775Z compiled=True, 2025-05-07T20:33:31.3686979Z ) 2025-05-07T20:33:31.4859451Z self = 2025-05-07T20:33:31.4860189Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:31.4860545Z 2025-05-07T20:33:31.4860634Z @given( 2025-05-07T20:33:31.4860890Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:31.4861239Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:31.4861589Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:31.4861978Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:31.4862347Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:31.4862666Z ) 2025-05-07T20:33:31.4863038Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:31.4863515Z def test_silu_mul_quant( 2025-05-07T20:33:31.4863768Z self, 2025-05-07T20:33:31.4863972Z T: int, 2025-05-07T20:33:31.4864182Z D: int, 2025-05-07T20:33:31.4864417Z scale_ub: Optional[float], 2025-05-07T20:33:31.4864704Z contiguous: bool, 2025-05-07T20:33:31.4864973Z compiled: bool, 2025-05-07T20:33:31.4865217Z ) -> None: 2025-05-07T20:33:31.4865670Z torch.manual_seed(2025) 2025-05-07T20:33:31.4865932Z 2025-05-07T20:33:31.4866213Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:31.4866572Z 2025-05-07T20:33:31.4866768Z x_sign = torch.sign(x) 2025-05-07T20:33:31.4867073Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:31.4867400Z x = x_sign * x_clamp 2025-05-07T20:33:31.4867646Z x0 = x[:, :D] 2025-05-07T20:33:31.4867870Z x1 = x[:, D:] 2025-05-07T20:33:31.4868087Z 2025-05-07T20:33:31.4868273Z if contiguous: 2025-05-07T20:33:31.4868538Z x0 = x0.contiguous() 2025-05-07T20:33:31.4868834Z x1 = x1.contiguous() 2025-05-07T20:33:31.4869079Z 2025-05-07T20:33:31.4869281Z if scale_ub is not None: 2025-05-07T20:33:31.4869564Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:31.4870097Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:31.4870426Z ) 2025-05-07T20:33:31.4870624Z else: 2025-05-07T20:33:31.4870830Z scale_ub_tensor = None 2025-05-07T20:33:31.4871093Z 2025-05-07T20:33:31.4871332Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:31.4871658Z op = silu_mul_quant 2025-05-07T20:33:31.4872018Z if compiled: 2025-05-07T20:33:31.4872273Z op = torch.compile(op) 2025-05-07T20:33:31.4872688Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:31.4872972Z 2025-05-07T20:33:31.4873170Z > y_fp8, y_scale = fn() 2025-05-07T20:33:31.4873341Z 2025-05-07T20:33:31.4873443Z moe/activation_test.py:117: 2025-05-07T20:33:31.4873741Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:31.4874188Z moe/activation_test.py:115: in fn 2025-05-07T20:33:31.4874480Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:31.4875073Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:31.4875673Z return fn(*args, **kwargs) 2025-05-07T20:33:31.4876380Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:31.4877122Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:31.4877687Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:31.4878422Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:31.4879127Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:31.4879688Z kernel = self.compile( 2025-05-07T20:33:31.4880264Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:31.4880965Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:31.4881380Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:31.4881621Z 2025-05-07T20:33:31.4881834Z self = 2025-05-07T20:33:31.4883340Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:31.4884883Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f58fd54fdc0>} 2025-05-07T20:33:31.4886369Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:31.4887582Z context = 2025-05-07T20:33:31.4887891Z 2025-05-07T20:33:31.4888062Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:31.4888618Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:31.4889115Z module_map=module_map) 2025-05-07T20:33:31.4889489Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:31.4889858Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:31.4890128Z E ^ 2025-05-07T20:33:31.4890623Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:31.4891115Z 2025-05-07T20:33:31.4891563Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:31.4892128Z 2025-05-07T20:33:31.4892232Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:31.4892667Z self=, 2025-05-07T20:33:31.4893093Z T=128, 2025-05-07T20:33:31.4893284Z D=7168, 2025-05-07T20:33:31.4893477Z scale_ub=1200.0, 2025-05-07T20:33:31.4893700Z contiguous=False, 2025-05-07T20:33:31.4893922Z compiled=True, 2025-05-07T20:33:31.4894128Z ) 2025-05-07T20:33:31.4894550Z self = 2025-05-07T20:33:31.4895194Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:31.4895487Z 2025-05-07T20:33:31.4895565Z @given( 2025-05-07T20:33:31.4895874Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:31.4896192Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:31.4896589Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:31.4896926Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:31.4897255Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:31.4897548Z ) 2025-05-07T20:33:31.4897910Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:31.4898376Z def test_silu_mul_quant( 2025-05-07T20:33:31.4898617Z self, 2025-05-07T20:33:31.4898811Z T: int, 2025-05-07T20:33:31.4899012Z D: int, 2025-05-07T20:33:31.4899228Z scale_ub: Optional[float], 2025-05-07T20:33:31.4899503Z contiguous: bool, 2025-05-07T20:33:31.4899749Z compiled: bool, 2025-05-07T20:33:31.4899968Z ) -> None: 2025-05-07T20:33:31.4900181Z torch.manual_seed(2025) 2025-05-07T20:33:31.4900422Z 2025-05-07T20:33:31.4900692Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:31.4901049Z 2025-05-07T20:33:31.4901253Z x_sign = torch.sign(x) 2025-05-07T20:33:31.4901542Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:31.4901859Z x = x_sign * x_clamp 2025-05-07T20:33:31.4902099Z x0 = x[:, :D] 2025-05-07T20:33:31.4902314Z x1 = x[:, D:] 2025-05-07T20:33:31.4902521Z 2025-05-07T20:33:31.4902706Z if contiguous: 2025-05-07T20:33:31.4902932Z x0 = x0.contiguous() 2025-05-07T20:33:31.4903192Z x1 = x1.contiguous() 2025-05-07T20:33:31.4903441Z 2025-05-07T20:33:31.4903634Z if scale_ub is not None: 2025-05-07T20:33:31.4903907Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:31.4904254Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:31.4904575Z ) 2025-05-07T20:33:31.4904758Z else: 2025-05-07T20:33:31.4904968Z scale_ub_tensor = None 2025-05-07T20:33:31.4905229Z 2025-05-07T20:33:31.4905453Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:31.4905784Z op = silu_mul_quant 2025-05-07T20:33:31.4906037Z if compiled: 2025-05-07T20:33:31.4906280Z op = torch.compile(op) 2025-05-07T20:33:31.4906633Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:31.4906920Z 2025-05-07T20:33:31.4907110Z > y_fp8, y_scale = fn() 2025-05-07T20:33:31.4907283Z 2025-05-07T20:33:31.4907381Z moe/activation_test.py:117: 2025-05-07T20:33:31.4907686Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:31.4908035Z moe/activation_test.py:115: in fn 2025-05-07T20:33:31.4908317Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:31.4908911Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:31.4909513Z return fn(*args, **kwargs) 2025-05-07T20:33:31.4910377Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:31.4911123Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:31.4911692Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:31.4912427Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:31.4913129Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:31.4913695Z kernel = self.compile( 2025-05-07T20:33:31.4914315Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:31.4915050Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:31.4915469Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:31.4915716Z 2025-05-07T20:33:31.4915928Z self = 2025-05-07T20:33:31.4917144Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:31.4918654Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f58fd466940>} 2025-05-07T20:33:31.4920118Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:31.4921231Z context = 2025-05-07T20:33:31.4921542Z 2025-05-07T20:33:31.4921709Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:31.4922262Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:31.4922750Z module_map=module_map) 2025-05-07T20:33:31.4923131Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:31.4923497Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:31.4923757Z E ^ 2025-05-07T20:33:31.4924253Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:31.4924748Z 2025-05-07T20:33:31.4925198Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:31.4925756Z 2025-05-07T20:33:31.6633561Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:31.6634260Z self=, 2025-05-07T20:33:31.6634868Z T=2048, 2025-05-07T20:33:31.6635095Z D=7168, 2025-05-07T20:33:31.6635293Z scale_ub=None, 2025-05-07T20:33:31.6635506Z contiguous=True, 2025-05-07T20:33:31.6635755Z compiled=True, 2025-05-07T20:33:31.6635967Z ) 2025-05-07T20:33:31.6636294Z self = 2025-05-07T20:33:31.6637102Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:31.6637390Z 2025-05-07T20:33:31.6637475Z @given( 2025-05-07T20:33:31.6637702Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:31.6638028Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:31.6638351Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:31.6638695Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:31.6639034Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:31.6642310Z ) 2025-05-07T20:33:31.6642677Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:31.6643149Z def test_silu_mul_quant( 2025-05-07T20:33:31.6643397Z self, 2025-05-07T20:33:31.6643603Z T: int, 2025-05-07T20:33:31.6643803Z D: int, 2025-05-07T20:33:31.6644023Z scale_ub: Optional[float], 2025-05-07T20:33:31.6644301Z contiguous: bool, 2025-05-07T20:33:31.6644543Z compiled: bool, 2025-05-07T20:33:31.6644785Z ) -> None: 2025-05-07T20:33:31.6645004Z torch.manual_seed(2025) 2025-05-07T20:33:31.6645248Z 2025-05-07T20:33:31.6645534Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:31.6645896Z 2025-05-07T20:33:31.6646089Z x_sign = torch.sign(x) 2025-05-07T20:33:31.6646485Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:31.6646812Z x = x_sign * x_clamp 2025-05-07T20:33:31.6647055Z x0 = x[:, :D] 2025-05-07T20:33:31.6647308Z x1 = x[:, D:] 2025-05-07T20:33:31.6647517Z 2025-05-07T20:33:31.6647710Z if contiguous: 2025-05-07T20:33:31.6647948Z x0 = x0.contiguous() 2025-05-07T20:33:31.6648211Z x1 = x1.contiguous() 2025-05-07T20:33:31.6648547Z 2025-05-07T20:33:31.6648790Z if scale_ub is not None: 2025-05-07T20:33:31.6649084Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:31.6649436Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:31.6649763Z ) 2025-05-07T20:33:31.6649955Z else: 2025-05-07T20:33:31.6650171Z scale_ub_tensor = None 2025-05-07T20:33:31.6657376Z 2025-05-07T20:33:31.6657662Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:31.6658020Z op = silu_mul_quant 2025-05-07T20:33:31.6658287Z if compiled: 2025-05-07T20:33:31.6658540Z op = torch.compile(op) 2025-05-07T20:33:31.6658853Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:31.6659143Z 2025-05-07T20:33:31.6659338Z > y_fp8, y_scale = fn() 2025-05-07T20:33:31.6659516Z 2025-05-07T20:33:31.6659619Z moe/activation_test.py:117: 2025-05-07T20:33:31.6659932Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:31.6660291Z moe/activation_test.py:115: in fn 2025-05-07T20:33:31.6660581Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:31.6661193Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:31.6661833Z return fn(*args, **kwargs) 2025-05-07T20:33:31.6662555Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:31.6663302Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:31.6663884Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:31.6664625Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:31.6665347Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:31.6665914Z kernel = self.compile( 2025-05-07T20:33:31.6666494Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:31.6667282Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:31.6667704Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:31.6667957Z 2025-05-07T20:33:31.6668174Z self = 2025-05-07T20:33:31.6669355Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:31.6671164Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f58fd5f3550>} 2025-05-07T20:33:31.6672642Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:31.6673752Z context = 2025-05-07T20:33:31.6674068Z 2025-05-07T20:33:31.6674240Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:31.6674797Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:31.6675345Z module_map=module_map) 2025-05-07T20:33:31.6675722Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:31.6676103Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:31.6676384Z E ^ 2025-05-07T20:33:31.6676877Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:31.6677422Z 2025-05-07T20:33:31.6677870Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:31.6678431Z 2025-05-07T20:33:31.6678541Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:31.6678972Z self=, 2025-05-07T20:33:31.6679395Z T=16384, 2025-05-07T20:33:31.6679594Z D=5120, 2025-05-07T20:33:31.6679791Z scale_ub=None, 2025-05-07T20:33:31.6680013Z contiguous=False, 2025-05-07T20:33:31.6680249Z compiled=False, 2025-05-07T20:33:31.6680454Z ) 2025-05-07T20:33:31.6680781Z self = 2025-05-07T20:33:31.6681306Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:31.6681604Z 2025-05-07T20:33:31.6681685Z @given( 2025-05-07T20:33:31.6681911Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:31.6682237Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:31.6682559Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:31.6683253Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:31.6683599Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:31.6683893Z ) 2025-05-07T20:33:31.6684250Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:31.6684715Z def test_silu_mul_quant( 2025-05-07T20:33:31.6684962Z self, 2025-05-07T20:33:31.6685161Z T: int, 2025-05-07T20:33:31.6685353Z D: int, 2025-05-07T20:33:31.6685572Z scale_ub: Optional[float], 2025-05-07T20:33:31.6685847Z contiguous: bool, 2025-05-07T20:33:31.6686088Z compiled: bool, 2025-05-07T20:33:31.6686318Z ) -> None: 2025-05-07T20:33:31.6686536Z torch.manual_seed(2025) 2025-05-07T20:33:31.6686779Z 2025-05-07T20:33:31.6687057Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:31.6687413Z 2025-05-07T20:33:31.6687604Z x_sign = torch.sign(x) 2025-05-07T20:33:31.6687902Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:31.6690264Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:31.6692419Z 2025-05-07T20:33:31.6692536Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:31.6692757Z 2025-05-07T20:33:31.6692867Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:31.6693293Z self=, 2025-05-07T20:33:31.6693708Z T=4096, 2025-05-07T20:33:31.6693891Z D=7168, 2025-05-07T20:33:31.6694081Z scale_ub=1200.0, 2025-05-07T20:33:31.6694299Z contiguous=True, 2025-05-07T20:33:31.6694521Z compiled=True, 2025-05-07T20:33:31.6694726Z ) 2025-05-07T20:33:31.6695046Z self = 2025-05-07T20:33:31.6695565Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:31.6695852Z 2025-05-07T20:33:31.6696006Z @given( 2025-05-07T20:33:31.6696231Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:31.6696552Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:31.6696871Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:31.6697212Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:31.6697543Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:31.6697906Z ) 2025-05-07T20:33:31.6698266Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:31.6698727Z def test_silu_mul_quant( 2025-05-07T20:33:31.6698968Z self, 2025-05-07T20:33:31.6699163Z T: int, 2025-05-07T20:33:31.6699353Z D: int, 2025-05-07T20:33:31.6699570Z scale_ub: Optional[float], 2025-05-07T20:33:31.6699846Z contiguous: bool, 2025-05-07T20:33:31.6700078Z compiled: bool, 2025-05-07T20:33:31.6700295Z ) -> None: 2025-05-07T20:33:31.6700509Z torch.manual_seed(2025) 2025-05-07T20:33:31.6700745Z 2025-05-07T20:33:31.6701017Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:31.6701373Z 2025-05-07T20:33:31.6701563Z x_sign = torch.sign(x) 2025-05-07T20:33:31.6701855Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:31.6704050Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:31.6706126Z 2025-05-07T20:33:31.6706246Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:31.6706466Z 2025-05-07T20:33:31.6706568Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:31.6706986Z self=, 2025-05-07T20:33:31.6707411Z T=16384, 2025-05-07T20:33:31.6707604Z D=7168, 2025-05-07T20:33:31.6707789Z scale_ub=None, 2025-05-07T20:33:31.6708004Z contiguous=False, 2025-05-07T20:33:31.6708227Z compiled=False, 2025-05-07T20:33:31.6708427Z ) 2025-05-07T20:33:31.7750303Z self = 2025-05-07T20:33:31.7751150Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:31.7751488Z 2025-05-07T20:33:31.7751575Z @given( 2025-05-07T20:33:31.7751802Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:31.7752129Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:31.7752443Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:31.7752792Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:31.7753126Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:31.7753426Z ) 2025-05-07T20:33:31.7753794Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:31.7754349Z def test_silu_mul_quant( 2025-05-07T20:33:31.7754595Z self, 2025-05-07T20:33:31.7754793Z T: int, 2025-05-07T20:33:31.7754993Z D: int, 2025-05-07T20:33:31.7755215Z scale_ub: Optional[float], 2025-05-07T20:33:31.7755494Z contiguous: bool, 2025-05-07T20:33:31.7755734Z compiled: bool, 2025-05-07T20:33:31.7755964Z ) -> None: 2025-05-07T20:33:31.7756184Z torch.manual_seed(2025) 2025-05-07T20:33:31.7756428Z 2025-05-07T20:33:31.7756704Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:31.7759067Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:31.7761246Z 2025-05-07T20:33:31.7761384Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:31.7761608Z 2025-05-07T20:33:31.7761708Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:31.7762134Z self=, 2025-05-07T20:33:31.7762551Z T=2048, 2025-05-07T20:33:31.7762739Z D=7168, 2025-05-07T20:33:31.7762933Z scale_ub=1200.0, 2025-05-07T20:33:31.7763147Z contiguous=True, 2025-05-07T20:33:31.7763365Z compiled=True, 2025-05-07T20:33:31.7763574Z ) 2025-05-07T20:33:31.7763891Z self = 2025-05-07T20:33:31.7764407Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:31.7764700Z 2025-05-07T20:33:31.7764776Z @given( 2025-05-07T20:33:31.7765000Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:31.7765312Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:31.7765627Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:31.7765962Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:31.7766290Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:31.7766586Z ) 2025-05-07T20:33:31.7766945Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:31.7767403Z def test_silu_mul_quant( 2025-05-07T20:33:31.7767646Z self, 2025-05-07T20:33:31.7767839Z T: int, 2025-05-07T20:33:31.7768028Z D: int, 2025-05-07T20:33:31.7768246Z scale_ub: Optional[float], 2025-05-07T20:33:31.7768520Z contiguous: bool, 2025-05-07T20:33:31.7768762Z compiled: bool, 2025-05-07T20:33:31.7768978Z ) -> None: 2025-05-07T20:33:31.7769193Z torch.manual_seed(2025) 2025-05-07T20:33:31.7769439Z 2025-05-07T20:33:31.7769706Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:31.7770061Z 2025-05-07T20:33:31.7770256Z x_sign = torch.sign(x) 2025-05-07T20:33:31.7770543Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:31.7772790Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:31.7774890Z 2025-05-07T20:33:31.7775006Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:31.7775231Z 2025-05-07T20:33:31.7775334Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:31.7775760Z self=, 2025-05-07T20:33:31.7776180Z T=2048, 2025-05-07T20:33:31.7776365Z D=7168, 2025-05-07T20:33:31.7776553Z scale_ub=None, 2025-05-07T20:33:31.7776758Z contiguous=True, 2025-05-07T20:33:31.7776981Z compiled=False, 2025-05-07T20:33:31.7777181Z ) 2025-05-07T20:33:31.7777498Z self = 2025-05-07T20:33:31.7778015Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:31.7778300Z 2025-05-07T20:33:31.7778381Z @given( 2025-05-07T20:33:31.7778677Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:31.7779002Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:31.7779324Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:31.7779668Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:31.7780008Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:31.7780310Z ) 2025-05-07T20:33:31.7780714Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:31.7781186Z def test_silu_mul_quant( 2025-05-07T20:33:31.7781431Z self, 2025-05-07T20:33:31.7781616Z T: int, 2025-05-07T20:33:31.7781813Z D: int, 2025-05-07T20:33:31.7782031Z scale_ub: Optional[float], 2025-05-07T20:33:31.7782298Z contiguous: bool, 2025-05-07T20:33:31.7782537Z compiled: bool, 2025-05-07T20:33:31.7783005Z ) -> None: 2025-05-07T20:33:31.7783255Z torch.manual_seed(2025) 2025-05-07T20:33:31.7783514Z 2025-05-07T20:33:31.7783798Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:31.7784163Z 2025-05-07T20:33:31.7784358Z > x_sign = torch.sign(x) 2025-05-07T20:33:31.7786498Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:31.7788556Z 2025-05-07T20:33:31.7788674Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:33:31.7788895Z 2025-05-07T20:33:31.7789004Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:31.7789427Z self=, 2025-05-07T20:33:31.7789930Z T=1, 2025-05-07T20:33:31.7790125Z D=7168, 2025-05-07T20:33:31.7790325Z scale_ub=1200.0, 2025-05-07T20:33:31.7790548Z contiguous=True, 2025-05-07T20:33:31.7790779Z compiled=False, 2025-05-07T20:33:31.7790991Z ) 2025-05-07T20:33:32.1129772Z self = 2025-05-07T20:33:32.1130398Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:32.1130689Z 2025-05-07T20:33:32.1130774Z @given( 2025-05-07T20:33:32.1131312Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:32.1131634Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:32.1131947Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:32.1132291Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:32.1132630Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:32.1132937Z ) 2025-05-07T20:33:32.1133306Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:32.1133765Z def test_silu_mul_quant( 2025-05-07T20:33:32.1134111Z self, 2025-05-07T20:33:32.1134304Z T: int, 2025-05-07T20:33:32.1134500Z D: int, 2025-05-07T20:33:32.1134721Z scale_ub: Optional[float], 2025-05-07T20:33:32.1134997Z contiguous: bool, 2025-05-07T20:33:32.1135234Z compiled: bool, 2025-05-07T20:33:32.1135470Z ) -> None: 2025-05-07T20:33:32.1135685Z torch.manual_seed(2025) 2025-05-07T20:33:32.1135927Z 2025-05-07T20:33:32.1136209Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:32.1136565Z 2025-05-07T20:33:32.1136757Z x_sign = torch.sign(x) 2025-05-07T20:33:32.1137048Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:32.1137366Z x = x_sign * x_clamp 2025-05-07T20:33:32.1137713Z x0 = x[:, :D] 2025-05-07T20:33:32.1137928Z x1 = x[:, D:] 2025-05-07T20:33:32.1138141Z 2025-05-07T20:33:32.1138329Z if contiguous: 2025-05-07T20:33:32.1138557Z x0 = x0.contiguous() 2025-05-07T20:33:32.1138823Z x1 = x1.contiguous() 2025-05-07T20:33:32.1139073Z 2025-05-07T20:33:32.1139262Z if scale_ub is not None: 2025-05-07T20:33:32.1139541Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:32.1139971Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:32.1140286Z ) 2025-05-07T20:33:32.1140482Z else: 2025-05-07T20:33:32.1140695Z scale_ub_tensor = None 2025-05-07T20:33:32.1140944Z 2025-05-07T20:33:32.1141177Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:32.1141504Z op = silu_mul_quant 2025-05-07T20:33:32.1141761Z if compiled: 2025-05-07T20:33:32.1142005Z op = torch.compile(op) 2025-05-07T20:33:32.1142311Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:32.1142594Z 2025-05-07T20:33:32.1142781Z > y_fp8, y_scale = fn() 2025-05-07T20:33:32.1142959Z 2025-05-07T20:33:32.1143057Z moe/activation_test.py:117: 2025-05-07T20:33:32.1143363Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:32.1143707Z moe/activation_test.py:115: in fn 2025-05-07T20:33:32.1143994Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:32.1144741Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:32.1145491Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:32.1146054Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:32.1146790Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:32.1147505Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:32.1148071Z kernel = self.compile( 2025-05-07T20:33:32.1148644Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:32.1149346Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:32.1149938Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:32.1150184Z 2025-05-07T20:33:32.1150395Z self = 2025-05-07T20:33:32.1151614Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:32.1153140Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f58fd197040>} 2025-05-07T20:33:32.1154603Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:32.1155760Z context = 2025-05-07T20:33:32.1156064Z 2025-05-07T20:33:32.1156237Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:32.1156785Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:32.1157283Z module_map=module_map) 2025-05-07T20:33:32.1157653Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:32.1158016Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:32.1158283Z E ^ 2025-05-07T20:33:32.1158820Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:32.1159311Z 2025-05-07T20:33:32.1159758Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:32.1160322Z 2025-05-07T20:33:32.1160422Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:32.1160848Z self=, 2025-05-07T20:33:32.1161305Z T=128, 2025-05-07T20:33:32.1161495Z D=5120, 2025-05-07T20:33:32.1161687Z scale_ub=None, 2025-05-07T20:33:32.1161896Z contiguous=True, 2025-05-07T20:33:32.1162114Z compiled=False, 2025-05-07T20:33:32.1162322Z ) 2025-05-07T20:33:32.1162647Z self = 2025-05-07T20:33:32.1163152Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:32.1163437Z 2025-05-07T20:33:32.1163511Z @given( 2025-05-07T20:33:32.1163744Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:32.1164063Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:32.1164385Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:32.1164729Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:32.1165059Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:32.1165358Z ) 2025-05-07T20:33:32.1165720Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:32.1166188Z def test_silu_mul_quant( 2025-05-07T20:33:32.1166425Z self, 2025-05-07T20:33:32.1166617Z T: int, 2025-05-07T20:33:32.1166813Z D: int, 2025-05-07T20:33:32.1167029Z scale_ub: Optional[float], 2025-05-07T20:33:32.1167303Z contiguous: bool, 2025-05-07T20:33:32.1167545Z compiled: bool, 2025-05-07T20:33:32.1167763Z ) -> None: 2025-05-07T20:33:32.1167980Z torch.manual_seed(2025) 2025-05-07T20:33:32.1168225Z 2025-05-07T20:33:32.1168493Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:32.1168851Z 2025-05-07T20:33:32.1169044Z x_sign = torch.sign(x) 2025-05-07T20:33:32.1169333Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:32.1169658Z x = x_sign * x_clamp 2025-05-07T20:33:32.1169901Z x0 = x[:, :D] 2025-05-07T20:33:32.1170113Z x1 = x[:, D:] 2025-05-07T20:33:32.1170324Z 2025-05-07T20:33:32.1170512Z if contiguous: 2025-05-07T20:33:32.1170739Z x0 = x0.contiguous() 2025-05-07T20:33:32.1170998Z x1 = x1.contiguous() 2025-05-07T20:33:32.1171238Z 2025-05-07T20:33:32.1171479Z if scale_ub is not None: 2025-05-07T20:33:32.1171752Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:32.1172093Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:32.1172410Z ) 2025-05-07T20:33:32.1172594Z else: 2025-05-07T20:33:32.1172798Z scale_ub_tensor = None 2025-05-07T20:33:32.1173055Z 2025-05-07T20:33:32.1173278Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:32.1173602Z op = silu_mul_quant 2025-05-07T20:33:32.1173911Z if compiled: 2025-05-07T20:33:32.1174153Z op = torch.compile(op) 2025-05-07T20:33:32.1174453Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:32.1174734Z 2025-05-07T20:33:32.1174920Z > y_fp8, y_scale = fn() 2025-05-07T20:33:32.1175099Z 2025-05-07T20:33:32.1175197Z moe/activation_test.py:117: 2025-05-07T20:33:32.1175499Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:32.1175846Z moe/activation_test.py:115: in fn 2025-05-07T20:33:32.1176123Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:32.1176861Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:32.1177651Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:32.1178214Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:32.1178944Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:32.1179660Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:32.1180225Z kernel = self.compile( 2025-05-07T20:33:32.1180830Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:32.1181535Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:32.1181952Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:32.1182193Z 2025-05-07T20:33:32.1182413Z self = 2025-05-07T20:33:32.1183929Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:32.1185448Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f58fd197a60>} 2025-05-07T20:33:32.1186917Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:32.1188035Z context = 2025-05-07T20:33:32.1188346Z 2025-05-07T20:33:32.1188520Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:32.1189075Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:32.1189571Z module_map=module_map) 2025-05-07T20:33:32.1190045Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:32.1190404Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:32.1190677Z E ^ 2025-05-07T20:33:32.1191174Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:32.1191661Z 2025-05-07T20:33:32.1192110Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:32.1192676Z 2025-05-07T20:33:32.1192776Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:32.1193298Z self=, 2025-05-07T20:33:32.1193735Z T=128, 2025-05-07T20:33:32.1200909Z D=7168, 2025-05-07T20:33:32.1201127Z scale_ub=None, 2025-05-07T20:33:32.1201348Z contiguous=True, 2025-05-07T20:33:32.1201589Z compiled=False, 2025-05-07T20:33:32.1201844Z ) 2025-05-07T20:33:32.2100627Z self = 2025-05-07T20:33:32.2101346Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:32.2101934Z 2025-05-07T20:33:32.2102017Z @given( 2025-05-07T20:33:32.2102267Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:32.2102601Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:32.2102926Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:32.2103279Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:32.2103627Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:32.2103927Z ) 2025-05-07T20:33:32.2104296Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:32.2104772Z def test_silu_mul_quant( 2025-05-07T20:33:32.2105024Z self, 2025-05-07T20:33:32.2105214Z T: int, 2025-05-07T20:33:32.2105417Z D: int, 2025-05-07T20:33:32.2105736Z scale_ub: Optional[float], 2025-05-07T20:33:32.2106013Z contiguous: bool, 2025-05-07T20:33:32.2106264Z compiled: bool, 2025-05-07T20:33:32.2106501Z ) -> None: 2025-05-07T20:33:32.2106719Z torch.manual_seed(2025) 2025-05-07T20:33:32.2106972Z 2025-05-07T20:33:32.2107252Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:32.2107608Z 2025-05-07T20:33:32.2107890Z x_sign = torch.sign(x) 2025-05-07T20:33:32.2108189Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:32.2108503Z x = x_sign * x_clamp 2025-05-07T20:33:32.2108750Z x0 = x[:, :D] 2025-05-07T20:33:32.2108969Z x1 = x[:, D:] 2025-05-07T20:33:32.2109173Z 2025-05-07T20:33:32.2109359Z if contiguous: 2025-05-07T20:33:32.2109591Z x0 = x0.contiguous() 2025-05-07T20:33:32.2109980Z x1 = x1.contiguous() 2025-05-07T20:33:32.2110225Z 2025-05-07T20:33:32.2110418Z if scale_ub is not None: 2025-05-07T20:33:32.2110696Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:32.2111034Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:32.2111359Z ) 2025-05-07T20:33:32.2111553Z else: 2025-05-07T20:33:32.2111758Z scale_ub_tensor = None 2025-05-07T20:33:32.2112016Z 2025-05-07T20:33:32.2112246Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:32.2112567Z op = silu_mul_quant 2025-05-07T20:33:32.2112817Z if compiled: 2025-05-07T20:33:32.2113066Z op = torch.compile(op) 2025-05-07T20:33:32.2113366Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:32.2113649Z 2025-05-07T20:33:32.2113844Z > y_fp8, y_scale = fn() 2025-05-07T20:33:32.2114010Z 2025-05-07T20:33:32.2114108Z moe/activation_test.py:117: 2025-05-07T20:33:32.2114411Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:32.2114759Z moe/activation_test.py:115: in fn 2025-05-07T20:33:32.2115046Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:32.2115782Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:32.2116532Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:32.2117105Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:32.2117836Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:32.2118642Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:32.2119213Z kernel = self.compile( 2025-05-07T20:33:32.2119792Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:32.2120489Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:32.2120906Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:32.2121147Z 2025-05-07T20:33:32.2121367Z self = 2025-05-07T20:33:32.2122591Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:32.2124107Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f58fd153790>} 2025-05-07T20:33:32.2125571Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:32.2126721Z context = 2025-05-07T20:33:32.2127028Z 2025-05-07T20:33:32.2127204Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:32.2127747Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:32.2128241Z module_map=module_map) 2025-05-07T20:33:32.2128622Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:32.2129026Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:32.2129285Z E ^ 2025-05-07T20:33:32.2129782Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:32.2130272Z 2025-05-07T20:33:32.2130726Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:32.2131283Z 2025-05-07T20:33:32.2131393Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:32.2131815Z self=, 2025-05-07T20:33:32.2132235Z T=2048, 2025-05-07T20:33:32.2132420Z D=7168, 2025-05-07T20:33:32.2132608Z scale_ub=1200.0, 2025-05-07T20:33:32.2132835Z contiguous=True, 2025-05-07T20:33:32.2133058Z compiled=False, 2025-05-07T20:33:32.2133262Z ) 2025-05-07T20:33:32.2133584Z self = 2025-05-07T20:33:32.2134109Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:32.2134399Z 2025-05-07T20:33:32.2134475Z @given( 2025-05-07T20:33:32.2134708Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:32.2135029Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:32.2135345Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:32.2135676Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:32.2136014Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:32.2136308Z ) 2025-05-07T20:33:32.2136667Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:32.2137134Z def test_silu_mul_quant( 2025-05-07T20:33:32.2137379Z self, 2025-05-07T20:33:32.2137566Z T: int, 2025-05-07T20:33:32.2137765Z D: int, 2025-05-07T20:33:32.2137986Z scale_ub: Optional[float], 2025-05-07T20:33:32.2138255Z contiguous: bool, 2025-05-07T20:33:32.2138503Z compiled: bool, 2025-05-07T20:33:32.2138751Z ) -> None: 2025-05-07T20:33:32.2138986Z torch.manual_seed(2025) 2025-05-07T20:33:32.2139228Z 2025-05-07T20:33:32.2139592Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:32.2141857Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:32.2143969Z 2025-05-07T20:33:32.2144087Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:32.2144310Z 2025-05-07T20:33:32.2144425Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:32.2144849Z self=, 2025-05-07T20:33:32.2145273Z T=1, 2025-05-07T20:33:32.2145462Z D=5120, 2025-05-07T20:33:32.2145646Z scale_ub=1200.0, 2025-05-07T20:33:32.2145870Z contiguous=True, 2025-05-07T20:33:32.2146093Z compiled=False, 2025-05-07T20:33:32.2146292Z ) 2025-05-07T20:33:32.2639343Z self = 2025-05-07T20:33:32.2640063Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:32.2640353Z 2025-05-07T20:33:32.2640431Z @given( 2025-05-07T20:33:32.2640662Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:32.2640987Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:32.2641305Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:32.2641645Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:32.2642060Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:32.2642354Z ) 2025-05-07T20:33:32.2642719Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:32.2643179Z def test_silu_mul_quant( 2025-05-07T20:33:32.2643422Z self, 2025-05-07T20:33:32.2643618Z T: int, 2025-05-07T20:33:32.2643812Z D: int, 2025-05-07T20:33:32.2644034Z scale_ub: Optional[float], 2025-05-07T20:33:32.2644315Z contiguous: bool, 2025-05-07T20:33:32.2644563Z compiled: bool, 2025-05-07T20:33:32.2644785Z ) -> None: 2025-05-07T20:33:32.2645002Z torch.manual_seed(2025) 2025-05-07T20:33:32.2645253Z 2025-05-07T20:33:32.2645523Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:32.2645885Z 2025-05-07T20:33:32.2646079Z x_sign = torch.sign(x) 2025-05-07T20:33:32.2646369Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:32.2646698Z x = x_sign * x_clamp 2025-05-07T20:33:32.2646945Z x0 = x[:, :D] 2025-05-07T20:33:32.2647163Z x1 = x[:, D:] 2025-05-07T20:33:32.2647374Z 2025-05-07T20:33:32.2647564Z if contiguous: 2025-05-07T20:33:32.2647798Z x0 = x0.contiguous() 2025-05-07T20:33:32.2648066Z x1 = x1.contiguous() 2025-05-07T20:33:32.2648317Z 2025-05-07T20:33:32.2648507Z if scale_ub is not None: 2025-05-07T20:33:32.2648792Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:32.2649144Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:32.2649468Z ) 2025-05-07T20:33:32.2649659Z else: 2025-05-07T20:33:32.2649874Z scale_ub_tensor = None 2025-05-07T20:33:32.2650136Z 2025-05-07T20:33:32.2650366Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:32.2650696Z op = silu_mul_quant 2025-05-07T20:33:32.2650954Z if compiled: 2025-05-07T20:33:32.2651203Z op = torch.compile(op) 2025-05-07T20:33:32.2651511Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:32.2651804Z 2025-05-07T20:33:32.2651995Z > y_fp8, y_scale = fn() 2025-05-07T20:33:32.2652254Z 2025-05-07T20:33:32.2652355Z moe/activation_test.py:117: 2025-05-07T20:33:32.2652664Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:32.2653003Z moe/activation_test.py:115: in fn 2025-05-07T20:33:32.2653292Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:32.2654039Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:32.2654801Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:32.2655463Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:32.2656200Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:32.2656912Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:32.2657486Z kernel = self.compile( 2025-05-07T20:33:32.2658065Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:32.2658771Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:32.2659180Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:32.2659480Z 2025-05-07T20:33:32.2659696Z self = 2025-05-07T20:33:32.2660869Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:32.2662393Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f58fd218040>} 2025-05-07T20:33:32.2663905Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:32.2665007Z context = 2025-05-07T20:33:32.2665322Z 2025-05-07T20:33:32.2665494Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:32.2666042Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:32.2666532Z module_map=module_map) 2025-05-07T20:33:32.2666915Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:32.2667282Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:32.2667548Z E ^ 2025-05-07T20:33:32.2668035Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:32.2668528Z 2025-05-07T20:33:32.2668979Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:32.2669534Z 2025-05-07T20:33:32.2669642Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:32.2670158Z self=, 2025-05-07T20:33:32.2670577Z T=2048, 2025-05-07T20:33:32.2670768Z D=5120, 2025-05-07T20:33:32.2670963Z scale_ub=None, 2025-05-07T20:33:32.2671170Z contiguous=True, 2025-05-07T20:33:32.2671394Z compiled=False, 2025-05-07T20:33:32.2671607Z ) 2025-05-07T20:33:32.2671929Z self = 2025-05-07T20:33:32.2672455Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:32.2672741Z 2025-05-07T20:33:32.2672825Z @given( 2025-05-07T20:33:32.2673051Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:32.2673377Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:32.2673754Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:32.2674101Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:32.2674436Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:32.2674734Z ) 2025-05-07T20:33:32.2675095Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:32.2675555Z def test_silu_mul_quant( 2025-05-07T20:33:32.2675800Z self, 2025-05-07T20:33:32.2676001Z T: int, 2025-05-07T20:33:32.2676194Z D: int, 2025-05-07T20:33:32.2676470Z scale_ub: Optional[float], 2025-05-07T20:33:32.2676747Z contiguous: bool, 2025-05-07T20:33:32.2676985Z compiled: bool, 2025-05-07T20:33:32.2677211Z ) -> None: 2025-05-07T20:33:32.2677431Z torch.manual_seed(2025) 2025-05-07T20:33:32.2677678Z 2025-05-07T20:33:32.2677958Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:32.2678318Z 2025-05-07T20:33:32.2678514Z > x_sign = torch.sign(x) 2025-05-07T20:33:32.2680771Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:32.2683072Z 2025-05-07T20:33:32.2683193Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:33:32.2683421Z 2025-05-07T20:33:32.2683524Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:32.2684026Z self=, 2025-05-07T20:33:32.2684443Z T=16384, 2025-05-07T20:33:32.2684642Z D=5120, 2025-05-07T20:33:32.2684840Z scale_ub=None, 2025-05-07T20:33:32.2685051Z contiguous=True, 2025-05-07T20:33:32.2685275Z compiled=False, 2025-05-07T20:33:32.2685482Z ) 2025-05-07T20:33:32.2685804Z self = 2025-05-07T20:33:32.2686337Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:32.2686641Z 2025-05-07T20:33:32.2686715Z @given( 2025-05-07T20:33:32.2686953Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:32.2687279Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:32.2687595Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:32.2687935Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:32.2688270Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:32.2688568Z ) 2025-05-07T20:33:32.2688929Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:32.2689396Z def test_silu_mul_quant( 2025-05-07T20:33:32.2689635Z self, 2025-05-07T20:33:32.2689830Z T: int, 2025-05-07T20:33:32.2690029Z D: int, 2025-05-07T20:33:32.2690242Z scale_ub: Optional[float], 2025-05-07T20:33:32.2690516Z contiguous: bool, 2025-05-07T20:33:32.2690763Z compiled: bool, 2025-05-07T20:33:32.2690982Z ) -> None: 2025-05-07T20:33:32.2691198Z torch.manual_seed(2025) 2025-05-07T20:33:32.2691449Z 2025-05-07T20:33:32.2691722Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:32.2694031Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:32.2696100Z 2025-05-07T20:33:32.2696218Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:32.2696444Z 2025-05-07T20:33:32.2696550Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:32.2696978Z self=, 2025-05-07T20:33:32.2697391Z T=4096, 2025-05-07T20:33:32.2697576Z D=5120, 2025-05-07T20:33:32.2697830Z scale_ub=None, 2025-05-07T20:33:32.2698037Z contiguous=True, 2025-05-07T20:33:32.2698258Z compiled=False, 2025-05-07T20:33:32.2698467Z ) 2025-05-07T20:33:32.3733131Z self = 2025-05-07T20:33:32.3733695Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:32.3733993Z 2025-05-07T20:33:32.3734076Z @given( 2025-05-07T20:33:32.3734314Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:32.3734634Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:32.3734950Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:32.3735292Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:32.3735784Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:32.3736085Z ) 2025-05-07T20:33:32.3736453Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:32.3736921Z def test_silu_mul_quant( 2025-05-07T20:33:32.3737170Z self, 2025-05-07T20:33:32.3737366Z T: int, 2025-05-07T20:33:32.3737565Z D: int, 2025-05-07T20:33:32.3737781Z scale_ub: Optional[float], 2025-05-07T20:33:32.3738146Z contiguous: bool, 2025-05-07T20:33:32.3738399Z compiled: bool, 2025-05-07T20:33:32.3738627Z ) -> None: 2025-05-07T20:33:32.3738847Z torch.manual_seed(2025) 2025-05-07T20:33:32.3739099Z 2025-05-07T20:33:32.3739376Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:32.3741615Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:32.3743677Z 2025-05-07T20:33:32.3743794Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:32.3744013Z 2025-05-07T20:33:32.3744142Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:32.3744568Z self=, 2025-05-07T20:33:32.3744991Z T=2048, 2025-05-07T20:33:32.3745168Z D=5120, 2025-05-07T20:33:32.3745356Z scale_ub=None, 2025-05-07T20:33:32.3745567Z contiguous=False, 2025-05-07T20:33:32.3745785Z compiled=False, 2025-05-07T20:33:32.3745988Z ) 2025-05-07T20:33:32.3746313Z self = 2025-05-07T20:33:32.3746830Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:32.3747119Z 2025-05-07T20:33:32.3747194Z @given( 2025-05-07T20:33:32.3747421Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:32.3747741Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:32.3748046Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:32.3748387Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:32.3748725Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:32.3749012Z ) 2025-05-07T20:33:32.3749482Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:32.3750037Z def test_silu_mul_quant( 2025-05-07T20:33:32.3750285Z self, 2025-05-07T20:33:32.3750473Z T: int, 2025-05-07T20:33:32.3750670Z D: int, 2025-05-07T20:33:32.3750890Z scale_ub: Optional[float], 2025-05-07T20:33:32.3751159Z contiguous: bool, 2025-05-07T20:33:32.3751405Z compiled: bool, 2025-05-07T20:33:32.3751626Z ) -> None: 2025-05-07T20:33:32.3751837Z torch.manual_seed(2025) 2025-05-07T20:33:32.3752081Z 2025-05-07T20:33:32.3752450Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:32.3754684Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:32.3756732Z 2025-05-07T20:33:32.3756848Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:32.3757118Z 2025-05-07T20:33:32.3757221Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:32.3757647Z self=, 2025-05-07T20:33:32.3758068Z T=4096, 2025-05-07T20:33:32.3758250Z D=7168, 2025-05-07T20:33:32.3758437Z scale_ub=None, 2025-05-07T20:33:32.3758649Z contiguous=True, 2025-05-07T20:33:32.3758864Z compiled=True, 2025-05-07T20:33:32.3759114Z ) 2025-05-07T20:33:32.3759443Z self = 2025-05-07T20:33:32.3759949Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:32.3760238Z 2025-05-07T20:33:32.3760313Z @given( 2025-05-07T20:33:32.3760539Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:32.3760863Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:32.3761170Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:32.3761508Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:32.3761846Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:32.3762135Z ) 2025-05-07T20:33:32.3762492Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:32.3762956Z def test_silu_mul_quant( 2025-05-07T20:33:32.3763191Z self, 2025-05-07T20:33:32.3763383Z T: int, 2025-05-07T20:33:32.3763579Z D: int, 2025-05-07T20:33:32.3763795Z scale_ub: Optional[float], 2025-05-07T20:33:32.3764069Z contiguous: bool, 2025-05-07T20:33:32.3764308Z compiled: bool, 2025-05-07T20:33:32.3764524Z ) -> None: 2025-05-07T20:33:32.3764740Z torch.manual_seed(2025) 2025-05-07T20:33:32.3764985Z 2025-05-07T20:33:32.3765259Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:32.3767500Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:32.3769567Z 2025-05-07T20:33:32.3769682Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:32.3769903Z 2025-05-07T20:33:32.3770003Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:32.3770474Z self=, 2025-05-07T20:33:32.3772216Z T=2048, 2025-05-07T20:33:32.3772397Z D=5120, 2025-05-07T20:33:32.3772587Z scale_ub=1200.0, 2025-05-07T20:33:32.3772803Z contiguous=False, 2025-05-07T20:33:32.3773028Z compiled=False, 2025-05-07T20:33:32.3773228Z ) 2025-05-07T20:33:32.3773558Z self = 2025-05-07T20:33:32.3774089Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:32.3774446Z 2025-05-07T20:33:32.3774525Z @given( 2025-05-07T20:33:32.3774761Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:32.3783639Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:32.3784098Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:32.3784446Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:32.3784785Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:32.3785074Z ) 2025-05-07T20:33:32.3785437Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:32.3785910Z def test_silu_mul_quant( 2025-05-07T20:33:32.3786156Z self, 2025-05-07T20:33:32.3786347Z T: int, 2025-05-07T20:33:32.3786532Z D: int, 2025-05-07T20:33:32.3786875Z scale_ub: Optional[float], 2025-05-07T20:33:32.3787150Z contiguous: bool, 2025-05-07T20:33:32.3787392Z compiled: bool, 2025-05-07T20:33:32.3787622Z ) -> None: 2025-05-07T20:33:32.3787838Z torch.manual_seed(2025) 2025-05-07T20:33:32.3788087Z 2025-05-07T20:33:32.3788372Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:32.3790766Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:32.3792904Z 2025-05-07T20:33:32.3793029Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:32.3793258Z 2025-05-07T20:33:32.3793363Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:32.3793798Z self=, 2025-05-07T20:33:32.3794224Z T=4096, 2025-05-07T20:33:32.3794410Z D=7168, 2025-05-07T20:33:32.3794605Z scale_ub=1200.0, 2025-05-07T20:33:32.3794832Z contiguous=True, 2025-05-07T20:33:32.3795055Z compiled=False, 2025-05-07T20:33:32.3795265Z ) 2025-05-07T20:33:32.3795593Z self = 2025-05-07T20:33:32.3796111Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:32.3796405Z 2025-05-07T20:33:32.3796481Z @given( 2025-05-07T20:33:32.3796708Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:32.3797023Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:32.3797343Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:32.3797683Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:32.3798019Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:32.3798309Z ) 2025-05-07T20:33:32.3798668Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:32.3799134Z def test_silu_mul_quant( 2025-05-07T20:33:32.3799373Z self, 2025-05-07T20:33:32.3799573Z T: int, 2025-05-07T20:33:32.3799766Z D: int, 2025-05-07T20:33:32.3799977Z scale_ub: Optional[float], 2025-05-07T20:33:32.3800254Z contiguous: bool, 2025-05-07T20:33:32.3800569Z compiled: bool, 2025-05-07T20:33:32.3800785Z ) -> None: 2025-05-07T20:33:32.3801001Z torch.manual_seed(2025) 2025-05-07T20:33:32.3801242Z 2025-05-07T20:33:32.3801511Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:32.3803762Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:32.3806446Z 2025-05-07T20:33:32.3806563Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:32.3806786Z 2025-05-07T20:33:32.3806892Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:32.3807317Z self=, 2025-05-07T20:33:32.3807731Z T=16384, 2025-05-07T20:33:32.3807925Z D=7168, 2025-05-07T20:33:32.3808115Z scale_ub=None, 2025-05-07T20:33:32.3808320Z contiguous=False, 2025-05-07T20:33:32.3808593Z compiled=True, 2025-05-07T20:33:32.3808798Z ) 2025-05-07T20:33:32.5106377Z self = 2025-05-07T20:33:32.5106947Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:32.5107272Z 2025-05-07T20:33:32.5107354Z @given( 2025-05-07T20:33:32.5107595Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:32.5108132Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:32.5108453Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:32.5108801Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:32.5109151Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:32.5109448Z ) 2025-05-07T20:33:32.5109980Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:32.5110457Z def test_silu_mul_quant( 2025-05-07T20:33:32.5110703Z self, 2025-05-07T20:33:32.5110903Z T: int, 2025-05-07T20:33:32.5111114Z D: int, 2025-05-07T20:33:32.5111333Z scale_ub: Optional[float], 2025-05-07T20:33:32.5111619Z contiguous: bool, 2025-05-07T20:33:32.5111865Z compiled: bool, 2025-05-07T20:33:32.5112091Z ) -> None: 2025-05-07T20:33:32.5112305Z torch.manual_seed(2025) 2025-05-07T20:33:32.5112553Z 2025-05-07T20:33:32.5112821Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:32.5115094Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:32.5117181Z 2025-05-07T20:33:32.5117297Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:32.5117524Z 2025-05-07T20:33:32.5117627Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:32.5118054Z self=, 2025-05-07T20:33:32.5118467Z T=4096, 2025-05-07T20:33:32.5118654Z D=7168, 2025-05-07T20:33:32.5118848Z scale_ub=None, 2025-05-07T20:33:32.5119055Z contiguous=True, 2025-05-07T20:33:32.5119277Z compiled=False, 2025-05-07T20:33:32.5119485Z ) 2025-05-07T20:33:32.5119892Z self = 2025-05-07T20:33:32.5120413Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:32.5120706Z 2025-05-07T20:33:32.5120783Z @given( 2025-05-07T20:33:32.5121010Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:32.5121329Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:32.5121642Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:32.5121976Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:32.5122414Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:32.5122729Z ) 2025-05-07T20:33:32.5123091Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:32.5123555Z def test_silu_mul_quant( 2025-05-07T20:33:32.5123798Z self, 2025-05-07T20:33:32.5123988Z T: int, 2025-05-07T20:33:32.5124184Z D: int, 2025-05-07T20:33:32.5124396Z scale_ub: Optional[float], 2025-05-07T20:33:32.5124674Z contiguous: bool, 2025-05-07T20:33:32.5124919Z compiled: bool, 2025-05-07T20:33:32.5125144Z ) -> None: 2025-05-07T20:33:32.5125356Z torch.manual_seed(2025) 2025-05-07T20:33:32.5125608Z 2025-05-07T20:33:32.5125882Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:32.5128201Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:32.5130307Z 2025-05-07T20:33:32.5130432Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:32.5130659Z 2025-05-07T20:33:32.5130761Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:32.5131188Z self=, 2025-05-07T20:33:32.5131614Z T=16384, 2025-05-07T20:33:32.5131801Z D=7168, 2025-05-07T20:33:32.5131996Z scale_ub=None, 2025-05-07T20:33:32.5132210Z contiguous=True, 2025-05-07T20:33:32.5132430Z compiled=False, 2025-05-07T20:33:32.5132636Z ) 2025-05-07T20:33:32.5132959Z self = 2025-05-07T20:33:32.5133475Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:32.5133774Z 2025-05-07T20:33:32.5133851Z @given( 2025-05-07T20:33:32.5134081Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:32.5134396Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:32.5134708Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:32.5135045Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:32.5135381Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:32.5135670Z ) 2025-05-07T20:33:32.5136031Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:32.5136497Z def test_silu_mul_quant( 2025-05-07T20:33:32.5136743Z self, 2025-05-07T20:33:32.5136940Z T: int, 2025-05-07T20:33:32.5137134Z D: int, 2025-05-07T20:33:32.5137345Z scale_ub: Optional[float], 2025-05-07T20:33:32.5137626Z contiguous: bool, 2025-05-07T20:33:32.5137870Z compiled: bool, 2025-05-07T20:33:32.5138088Z ) -> None: 2025-05-07T20:33:32.5138303Z torch.manual_seed(2025) 2025-05-07T20:33:32.5138548Z 2025-05-07T20:33:32.5138847Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:32.5141171Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:32.5143239Z 2025-05-07T20:33:32.5143355Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:32.5143618Z 2025-05-07T20:33:32.5143722Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:32.5144146Z self=, 2025-05-07T20:33:32.5144564Z T=16384, 2025-05-07T20:33:32.5144756Z D=7168, 2025-05-07T20:33:32.5144948Z scale_ub=1200.0, 2025-05-07T20:33:32.5145165Z contiguous=True, 2025-05-07T20:33:32.5145388Z compiled=False, 2025-05-07T20:33:32.5145597Z ) 2025-05-07T20:33:32.5145915Z self = 2025-05-07T20:33:32.5146440Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:32.5146741Z 2025-05-07T20:33:32.5146822Z @given( 2025-05-07T20:33:32.5147095Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:32.5147413Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:32.5147730Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:32.5148079Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:32.5148414Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:32.5148717Z ) 2025-05-07T20:33:32.5149080Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:32.5149586Z def test_silu_mul_quant( 2025-05-07T20:33:32.5149936Z self, 2025-05-07T20:33:32.5150130Z T: int, 2025-05-07T20:33:32.5150342Z D: int, 2025-05-07T20:33:32.5150560Z scale_ub: Optional[float], 2025-05-07T20:33:32.5150837Z contiguous: bool, 2025-05-07T20:33:32.5151086Z compiled: bool, 2025-05-07T20:33:32.5151306Z ) -> None: 2025-05-07T20:33:32.5151525Z torch.manual_seed(2025) 2025-05-07T20:33:32.5151771Z 2025-05-07T20:33:32.5152043Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:32.5154288Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:32.5156355Z 2025-05-07T20:33:32.5156474Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:32.5156702Z 2025-05-07T20:33:32.5156803Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:32.5157229Z self=, 2025-05-07T20:33:32.5157644Z T=128, 2025-05-07T20:33:32.5157836Z D=5120, 2025-05-07T20:33:32.5158028Z scale_ub=1200.0, 2025-05-07T20:33:32.5158248Z contiguous=False, 2025-05-07T20:33:32.5158481Z compiled=False, 2025-05-07T20:33:32.5158688Z ) 2025-05-07T20:33:32.6789679Z self = 2025-05-07T20:33:32.6790523Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:32.6790895Z 2025-05-07T20:33:32.6790975Z @given( 2025-05-07T20:33:32.6791206Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:32.6791529Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:32.6792181Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:32.6792526Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:32.6792865Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:32.6793154Z ) 2025-05-07T20:33:32.6793525Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:32.6793992Z def test_silu_mul_quant( 2025-05-07T20:33:32.6794232Z self, 2025-05-07T20:33:32.6794425Z T: int, 2025-05-07T20:33:32.6794621Z D: int, 2025-05-07T20:33:32.6794922Z scale_ub: Optional[float], 2025-05-07T20:33:32.6795191Z contiguous: bool, 2025-05-07T20:33:32.6795440Z compiled: bool, 2025-05-07T20:33:32.6795667Z ) -> None: 2025-05-07T20:33:32.6795882Z torch.manual_seed(2025) 2025-05-07T20:33:32.6796130Z 2025-05-07T20:33:32.6796403Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:32.6796752Z 2025-05-07T20:33:32.6796949Z x_sign = torch.sign(x) 2025-05-07T20:33:32.6797249Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:32.6797569Z x = x_sign * x_clamp 2025-05-07T20:33:32.6797819Z x0 = x[:, :D] 2025-05-07T20:33:32.6798039Z x1 = x[:, D:] 2025-05-07T20:33:32.6798242Z 2025-05-07T20:33:32.6798522Z if contiguous: 2025-05-07T20:33:32.6798765Z x0 = x0.contiguous() 2025-05-07T20:33:32.6799026Z x1 = x1.contiguous() 2025-05-07T20:33:32.6799272Z 2025-05-07T20:33:32.6799470Z if scale_ub is not None: 2025-05-07T20:33:32.6799745Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:32.6800098Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:32.6800420Z ) 2025-05-07T20:33:32.6800696Z else: 2025-05-07T20:33:32.6800903Z scale_ub_tensor = None 2025-05-07T20:33:32.6801162Z 2025-05-07T20:33:32.6801396Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:32.6801721Z op = silu_mul_quant 2025-05-07T20:33:32.6801980Z if compiled: 2025-05-07T20:33:32.6802233Z op = torch.compile(op) 2025-05-07T20:33:32.6802536Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:32.6802822Z 2025-05-07T20:33:32.6803023Z > y_fp8, y_scale = fn() 2025-05-07T20:33:32.6803192Z 2025-05-07T20:33:32.6803288Z moe/activation_test.py:117: 2025-05-07T20:33:32.6803591Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:32.6803940Z moe/activation_test.py:115: in fn 2025-05-07T20:33:32.6804227Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:32.6804966Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:32.6805715Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:32.6806289Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:32.6807017Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:32.6807725Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:32.6808294Z kernel = self.compile( 2025-05-07T20:33:32.6808905Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:32.6809621Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:32.6810039Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:32.6810280Z 2025-05-07T20:33:32.6810500Z self = 2025-05-07T20:33:32.6811727Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:32.6813241Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f58fd00bca0>} 2025-05-07T20:33:32.6814715Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:32.6815868Z context = 2025-05-07T20:33:32.6816173Z 2025-05-07T20:33:32.6816348Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:32.6816891Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:32.6817386Z module_map=module_map) 2025-05-07T20:33:32.6817769Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:32.6818130Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:32.6818392Z E ^ 2025-05-07T20:33:32.6818885Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:32.6819373Z 2025-05-07T20:33:32.6819869Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:32.6820427Z 2025-05-07T20:33:32.6820529Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:32.6820961Z self=, 2025-05-07T20:33:32.6821385Z T=2048, 2025-05-07T20:33:32.6821574Z D=7168, 2025-05-07T20:33:32.6821759Z scale_ub=None, 2025-05-07T20:33:32.6822023Z contiguous=False, 2025-05-07T20:33:32.6822257Z compiled=False, 2025-05-07T20:33:32.6822463Z ) 2025-05-07T20:33:32.6822788Z self = 2025-05-07T20:33:32.6823312Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:32.6823601Z 2025-05-07T20:33:32.6823679Z @given( 2025-05-07T20:33:32.6823908Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:32.6824231Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:32.6824545Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:32.6824888Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:32.6825234Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:32.6825534Z ) 2025-05-07T20:33:32.6825890Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:32.6826355Z def test_silu_mul_quant( 2025-05-07T20:33:32.6826602Z self, 2025-05-07T20:33:32.6826795Z T: int, 2025-05-07T20:33:32.6826996Z D: int, 2025-05-07T20:33:32.6827213Z scale_ub: Optional[float], 2025-05-07T20:33:32.6827480Z contiguous: bool, 2025-05-07T20:33:32.6827724Z compiled: bool, 2025-05-07T20:33:32.6827952Z ) -> None: 2025-05-07T20:33:32.6828160Z torch.manual_seed(2025) 2025-05-07T20:33:32.6828406Z 2025-05-07T20:33:32.6828679Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:32.6831043Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:32.6833104Z 2025-05-07T20:33:32.6833227Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:32.6833445Z 2025-05-07T20:33:32.6833597Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:32.6834026Z self=, 2025-05-07T20:33:32.6834447Z T=128, 2025-05-07T20:33:32.6834626Z D=7168, 2025-05-07T20:33:32.6834818Z scale_ub=1200.0, 2025-05-07T20:33:32.6835044Z contiguous=True, 2025-05-07T20:33:32.6835263Z compiled=True, 2025-05-07T20:33:32.6835467Z ) 2025-05-07T20:33:32.7286323Z self = 2025-05-07T20:33:32.7287076Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:32.7287369Z 2025-05-07T20:33:32.7287460Z @given( 2025-05-07T20:33:32.7287703Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:32.7288030Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:32.7288350Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:32.7288702Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:32.7289044Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:32.7289348Z ) 2025-05-07T20:33:32.7289720Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:32.7290188Z def test_silu_mul_quant( 2025-05-07T20:33:32.7290445Z self, 2025-05-07T20:33:32.7290723Z T: int, 2025-05-07T20:33:32.7290919Z D: int, 2025-05-07T20:33:32.7291141Z scale_ub: Optional[float], 2025-05-07T20:33:32.7291420Z contiguous: bool, 2025-05-07T20:33:32.7291661Z compiled: bool, 2025-05-07T20:33:32.7291887Z ) -> None: 2025-05-07T20:33:32.7292105Z torch.manual_seed(2025) 2025-05-07T20:33:32.7292354Z 2025-05-07T20:33:32.7292625Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:32.7293061Z 2025-05-07T20:33:32.7293255Z x_sign = torch.sign(x) 2025-05-07T20:33:32.7293545Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:32.7293872Z x = x_sign * x_clamp 2025-05-07T20:33:32.7294121Z x0 = x[:, :D] 2025-05-07T20:33:32.7294333Z x1 = x[:, D:] 2025-05-07T20:33:32.7294541Z 2025-05-07T20:33:32.7294727Z if contiguous: 2025-05-07T20:33:32.7294955Z x0 = x0.contiguous() 2025-05-07T20:33:32.7295228Z x1 = x1.contiguous() 2025-05-07T20:33:32.7295473Z 2025-05-07T20:33:32.7295660Z if scale_ub is not None: 2025-05-07T20:33:32.7295939Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:32.7296283Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:32.7296603Z ) 2025-05-07T20:33:32.7296798Z else: 2025-05-07T20:33:32.7297010Z scale_ub_tensor = None 2025-05-07T20:33:32.7297267Z 2025-05-07T20:33:32.7297497Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:32.7297821Z op = silu_mul_quant 2025-05-07T20:33:32.7298077Z if compiled: 2025-05-07T20:33:32.7298330Z op = torch.compile(op) 2025-05-07T20:33:32.7298634Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:32.7298922Z 2025-05-07T20:33:32.7299106Z > y_fp8, y_scale = fn() 2025-05-07T20:33:32.7299280Z 2025-05-07T20:33:32.7299377Z moe/activation_test.py:117: 2025-05-07T20:33:32.7299681Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:32.7300017Z moe/activation_test.py:115: in fn 2025-05-07T20:33:32.7300304Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:32.7300895Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:32.7301494Z return fn(*args, **kwargs) 2025-05-07T20:33:32.7302193Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:32.7302938Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:32.7303590Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:32.7304319Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:32.7305032Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:32.7305610Z kernel = self.compile( 2025-05-07T20:33:32.7306186Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:32.7306926Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:32.7307346Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:32.7307587Z 2025-05-07T20:33:32.7307812Z self = 2025-05-07T20:33:32.7309027Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:32.7310707Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f58fcf3c0d0>} 2025-05-07T20:33:32.7312234Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:32.7313348Z context = 2025-05-07T20:33:32.7313650Z 2025-05-07T20:33:32.7313825Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:32.7314444Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:32.7314940Z module_map=module_map) 2025-05-07T20:33:32.7323986Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:32.7324387Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:32.7324661Z E ^ 2025-05-07T20:33:32.7325174Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:32.7325677Z 2025-05-07T20:33:32.7326137Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:32.7326699Z 2025-05-07T20:33:32.7326820Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:32.7327250Z self=, 2025-05-07T20:33:32.7327686Z T=128, 2025-05-07T20:33:32.7327886Z D=7168, 2025-05-07T20:33:32.7328084Z scale_ub=1200.0, 2025-05-07T20:33:32.7328316Z contiguous=True, 2025-05-07T20:33:32.7328550Z compiled=False, 2025-05-07T20:33:32.7328765Z ) 2025-05-07T20:33:32.7329095Z self = 2025-05-07T20:33:32.7329627Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:32.7329919Z 2025-05-07T20:33:32.7330009Z @given( 2025-05-07T20:33:32.7330241Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:32.7330575Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:32.7330901Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:32.7331247Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:32.7331603Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:32.7331911Z ) 2025-05-07T20:33:32.7332282Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:32.7332760Z def test_silu_mul_quant( 2025-05-07T20:33:32.7333016Z self, 2025-05-07T20:33:32.7333213Z T: int, 2025-05-07T20:33:32.7333418Z D: int, 2025-05-07T20:33:32.7333639Z scale_ub: Optional[float], 2025-05-07T20:33:32.7334029Z contiguous: bool, 2025-05-07T20:33:32.7334283Z compiled: bool, 2025-05-07T20:33:32.7334507Z ) -> None: 2025-05-07T20:33:32.7334728Z torch.manual_seed(2025) 2025-05-07T20:33:32.7334982Z 2025-05-07T20:33:32.7335258Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:32.7335620Z 2025-05-07T20:33:32.7335820Z x_sign = torch.sign(x) 2025-05-07T20:33:32.7336112Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:32.7338380Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:32.7340494Z 2025-05-07T20:33:32.7340612Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:32.7340840Z 2025-05-07T20:33:32.7340943Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:32.7341415Z self=, 2025-05-07T20:33:32.7341838Z T=128, 2025-05-07T20:33:32.7342029Z D=5120, 2025-05-07T20:33:32.7342223Z scale_ub=1200.0, 2025-05-07T20:33:32.7342448Z contiguous=True, 2025-05-07T20:33:32.7342675Z compiled=True, 2025-05-07T20:33:32.7342884Z ) 2025-05-07T20:33:32.7343207Z self = 2025-05-07T20:33:32.7343781Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:32.7344071Z 2025-05-07T20:33:32.7344151Z @given( 2025-05-07T20:33:32.7344388Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:32.7344705Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:32.7345021Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:32.7345366Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:32.7345703Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:32.7346002Z ) 2025-05-07T20:33:32.7346366Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:32.7346828Z def test_silu_mul_quant( 2025-05-07T20:33:32.7347081Z self, 2025-05-07T20:33:32.7347276Z T: int, 2025-05-07T20:33:32.7347471Z D: int, 2025-05-07T20:33:32.7347693Z scale_ub: Optional[float], 2025-05-07T20:33:32.7347977Z contiguous: bool, 2025-05-07T20:33:32.7348222Z compiled: bool, 2025-05-07T20:33:32.7348451Z ) -> None: 2025-05-07T20:33:32.7348670Z torch.manual_seed(2025) 2025-05-07T20:33:32.7348922Z 2025-05-07T20:33:32.7349197Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:32.7349550Z 2025-05-07T20:33:32.7349748Z x_sign = torch.sign(x) 2025-05-07T20:33:32.7350141Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:32.7352339Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:32.7354397Z 2025-05-07T20:33:32.7354515Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:32.7354737Z 2025-05-07T20:33:32.7354902Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:32.7355335Z self=, 2025-05-07T20:33:32.7355757Z T=128, 2025-05-07T20:33:32.7355948Z D=7168, 2025-05-07T20:33:32.7356142Z scale_ub=None, 2025-05-07T20:33:32.7356353Z contiguous=True, 2025-05-07T20:33:32.7356576Z compiled=True, 2025-05-07T20:33:32.7356785Z ) 2025-05-07T20:33:32.9426705Z self = 2025-05-07T20:33:32.9427269Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:32.9427813Z 2025-05-07T20:33:32.9427896Z @given( 2025-05-07T20:33:32.9428140Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:32.9428461Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:32.9428787Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:32.9429130Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:32.9429478Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:32.9429912Z ) 2025-05-07T20:33:32.9430279Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:32.9430748Z def test_silu_mul_quant( 2025-05-07T20:33:32.9430986Z self, 2025-05-07T20:33:32.9431185Z T: int, 2025-05-07T20:33:32.9431474Z D: int, 2025-05-07T20:33:32.9431691Z scale_ub: Optional[float], 2025-05-07T20:33:32.9431973Z contiguous: bool, 2025-05-07T20:33:32.9432217Z compiled: bool, 2025-05-07T20:33:32.9432452Z ) -> None: 2025-05-07T20:33:32.9432672Z torch.manual_seed(2025) 2025-05-07T20:33:32.9432912Z 2025-05-07T20:33:32.9433179Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:32.9435531Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:32.9437612Z 2025-05-07T20:33:32.9437729Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:32.9437954Z 2025-05-07T20:33:32.9448178Z FAILED 2025-05-07T20:33:32.9448402Z 2025-05-07T20:33:32.9448640Z =================================== FAILURES =================================== 2025-05-07T20:33:32.9449329Z _____________________ ActivationTests.test_silu_mul_quant ______________________ 2025-05-07T20:33:32.9449981Z + Exception Group Traceback (most recent call last): 2025-05-07T20:33:32.9450882Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/unittest/case.py", line 59, in testPartExecutor 2025-05-07T20:33:32.9451692Z | yield 2025-05-07T20:33:32.9452219Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/unittest/case.py", line 592, in run 2025-05-07T20:33:32.9452779Z | self._callTestMethod(testMethod) 2025-05-07T20:33:32.9453438Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/unittest/case.py", line 550, in _callTestMethod 2025-05-07T20:33:32.9454310Z | method() 2025-05-07T20:33:32.9455253Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 75, in test_silu_mul_quant 2025-05-07T20:33:32.9456345Z | T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:32.9457277Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/hypothesis/core.py", line 1850, in wrapped_test 2025-05-07T20:33:32.9458210Z | raise the_error_hypothesis_found 2025-05-07T20:33:32.9459080Z | exceptiongroup.ExceptionGroup: Hypothesis found 4 distinct failures. (4 sub-exceptions) 2025-05-07T20:33:32.9459802Z +-+---------------- 1 ---------------- 2025-05-07T20:33:32.9460199Z | Traceback (most recent call last): 2025-05-07T20:33:32.9461216Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:33:32.9462341Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:32.9465346Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:32.9468317Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:33:32.9468933Z | self=, 2025-05-07T20:33:32.9469506Z | T=2048, 2025-05-07T20:33:32.9469981Z | D=5120, # or any other generated value 2025-05-07T20:33:32.9470528Z | scale_ub=None, # or any other generated value 2025-05-07T20:33:32.9471050Z | contiguous=True, # or any other generated value 2025-05-07T20:33:32.9471551Z | compiled=False, # or any other generated value 2025-05-07T20:33:32.9471980Z | ) 2025-05-07T20:33:32.9472218Z | 2025-05-07T20:33:32.9472951Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEECQQBBAEEAQQE=') as a decorator on your test case 2025-05-07T20:33:32.9473884Z +---------------- 2 ---------------- 2025-05-07T20:33:32.9474290Z | Traceback (most recent call last): 2025-05-07T20:33:32.9475308Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:33:32.9476423Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:32.9479449Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:32.9482351Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:33:32.9483209Z | self=, 2025-05-07T20:33:32.9483643Z | T=128, 2025-05-07T20:33:32.9483843Z | D=7168, 2025-05-07T20:33:32.9484055Z | scale_ub=None, 2025-05-07T20:33:32.9484303Z | contiguous=True, 2025-05-07T20:33:32.9484545Z | compiled=True, 2025-05-07T20:33:32.9484769Z | ) 2025-05-07T20:33:32.9485820Z | 2025-05-07T20:33:32.9486438Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQFBAEEAQQA=') as a decorator on your test case 2025-05-07T20:33:32.9487084Z +---------------- 3 ---------------- 2025-05-07T20:33:32.9487385Z | Traceback (most recent call last): 2025-05-07T20:33:32.9488143Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:33:32.9488979Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:32.9491318Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:32.9493537Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:33:32.9493992Z | self=, 2025-05-07T20:33:32.9494423Z | T=128, 2025-05-07T20:33:32.9494624Z | D=5120, 2025-05-07T20:33:32.9494841Z | scale_ub=1200.0, 2025-05-07T20:33:32.9495091Z | contiguous=True, 2025-05-07T20:33:32.9495331Z | compiled=True, 2025-05-07T20:33:32.9495647Z | ) 2025-05-07T20:33:32.9495897Z | 2025-05-07T20:33:32.9496664Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQBBAUEAQQA=') as a decorator on your test case 2025-05-07T20:33:32.9497535Z +---------------- 4 ---------------- 2025-05-07T20:33:32.9497943Z | Traceback (most recent call last): 2025-05-07T20:33:32.9499147Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 126, in test_silu_mul_quant 2025-05-07T20:33:32.9500218Z | y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:32.9501202Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 124, in ref_fn 2025-05-07T20:33:32.9502328Z | return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:32.9503598Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py", line 2370, in triton_quantize_fp8_row 2025-05-07T20:33:32.9504798Z | _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:32.9505702Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py", line 330, in 2025-05-07T20:33:32.9506782Z | return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:32.9507867Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py", line 186, in run 2025-05-07T20:33:32.9509026Z | timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:32.9510349Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py", line 186, in 2025-05-07T20:33:32.9511561Z | timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:32.9512733Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py", line 166, in _bench 2025-05-07T20:33:32.9513775Z | return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:32.9514746Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py", line 117, in do_bench 2025-05-07T20:33:32.9515590Z | fn() 2025-05-07T20:33:32.9516425Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py", line 152, in kernel_call 2025-05-07T20:33:32.9517380Z | self.fn.run( 2025-05-07T20:33:32.9518158Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py", line 623, in run 2025-05-07T20:33:32.9519069Z | kernel = self.compile( 2025-05-07T20:33:32.9520002Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 273, in compile 2025-05-07T20:33:32.9521029Z | module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:32.9522090Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:33:32.9523281Z | return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:32.9524046Z | triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:32.9524558Z | def _kernel_quantize_fp8_row( 2025-05-07T20:33:32.9524992Z | ^ 2025-05-07T20:33:32.9525666Z | ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:32.9526523Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:33:32.9527102Z | # The test always failed when commented parts were varied together. 2025-05-07T20:33:32.9527849Z | self=, 2025-05-07T20:33:32.9528489Z | T=1, # or any other generated value 2025-05-07T20:33:32.9528942Z | D=5120, # or any other generated value 2025-05-07T20:33:32.9529423Z | scale_ub=None, # or any other generated value 2025-05-07T20:33:32.9529933Z | contiguous=True, # or any other generated value 2025-05-07T20:33:32.9530516Z | compiled=True, # or any other generated value 2025-05-07T20:33:32.9530949Z | ) 2025-05-07T20:33:32.9531182Z | 2025-05-07T20:33:32.9531925Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEAQQBBAEEAQQA=') as a decorator on your test case 2025-05-07T20:33:32.9532803Z +------------------------------------ 2025-05-07T20:33:32.9533298Z ---------------------------------- Hypothesis ---------------------------------- 2025-05-07T20:33:32.9533907Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:32.9534512Z self=, 2025-05-07T20:33:32.9535106Z T=1, 2025-05-07T20:33:32.9535363Z D=5120, 2025-05-07T20:33:32.9535635Z scale_ub=None, 2025-05-07T20:33:32.9535942Z contiguous=True, 2025-05-07T20:33:32.9536251Z compiled=True, 2025-05-07T20:33:32.9536543Z ) 2025-05-07T20:33:32.9536997Z self = 2025-05-07T20:33:32.9537697Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:32.9538072Z 2025-05-07T20:33:32.9538181Z @given( 2025-05-07T20:33:32.9538487Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:32.9538951Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:32.9539362Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:32.9539820Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:32.9540265Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:32.9540646Z ) 2025-05-07T20:33:32.9541131Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:32.9541764Z def test_silu_mul_quant( 2025-05-07T20:33:32.9542099Z self, 2025-05-07T20:33:32.9542374Z T: int, 2025-05-07T20:33:32.9542647Z D: int, 2025-05-07T20:33:32.9542947Z scale_ub: Optional[float], 2025-05-07T20:33:32.9543335Z contiguous: bool, 2025-05-07T20:33:32.9543665Z compiled: bool, 2025-05-07T20:33:32.9543961Z ) -> None: 2025-05-07T20:33:32.9544255Z torch.manual_seed(2025) 2025-05-07T20:33:32.9544607Z 2025-05-07T20:33:32.9544987Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:32.9545490Z 2025-05-07T20:33:32.9545752Z x_sign = torch.sign(x) 2025-05-07T20:33:32.9546159Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:32.9546591Z x = x_sign * x_clamp 2025-05-07T20:33:32.9546923Z x0 = x[:, :D] 2025-05-07T20:33:32.9547187Z x1 = x[:, D:] 2025-05-07T20:33:32.9547554Z 2025-05-07T20:33:32.9547785Z if contiguous: 2025-05-07T20:33:32.9548100Z x0 = x0.contiguous() 2025-05-07T20:33:32.9548457Z x1 = x1.contiguous() 2025-05-07T20:33:32.9548803Z 2025-05-07T20:33:32.9549116Z if scale_ub is not None: 2025-05-07T20:33:32.9549506Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:32.9550142Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:32.9550591Z ) 2025-05-07T20:33:32.9550871Z else: 2025-05-07T20:33:32.9551225Z scale_ub_tensor = None 2025-05-07T20:33:32.9551593Z 2025-05-07T20:33:32.9551910Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:32.9552371Z op = silu_mul_quant 2025-05-07T20:33:32.9552734Z if compiled: 2025-05-07T20:33:32.9553089Z op = torch.compile(op) 2025-05-07T20:33:32.9553506Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:32.9553907Z 2025-05-07T20:33:32.9554175Z y_fp8, y_scale = fn() 2025-05-07T20:33:32.9554578Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:32.9555001Z 2025-05-07T20:33:32.9555331Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:32.9555776Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:32.9556245Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:32.9556665Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:32.9557148Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:32.9557582Z 2025-05-07T20:33:32.9557855Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:32.9558127Z 2025-05-07T20:33:32.9558266Z moe/activation_test.py:126: 2025-05-07T20:33:32.9558708Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:32.9559155Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:32.9559585Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:32.9560733Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:32.9561835Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:32.9562611Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:32.9563593Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:32.9564565Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:32.9565566Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:32.9566654Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:33:32.9567698Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:32.9568740Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:32.9569648Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:32.9570527Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:32.9571264Z fn() 2025-05-07T20:33:32.9571974Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:32.9572802Z self.fn.run( 2025-05-07T20:33:32.9573454Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:32.9574205Z kernel = self.compile( 2025-05-07T20:33:32.9574968Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:32.9575964Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:32.9576528Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:32.9576869Z 2025-05-07T20:33:32.9577149Z self = 2025-05-07T20:33:32.9578691Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:32.9580877Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f59015d99d0>} 2025-05-07T20:33:32.9583126Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:32.9584644Z context = 2025-05-07T20:33:32.9585075Z 2025-05-07T20:33:32.9585311Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:32.9586084Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:32.9586935Z module_map=module_map) 2025-05-07T20:33:32.9587470Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:32.9587976Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:32.9588360Z E ^ 2025-05-07T20:33:32.9589037Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:32.9589733Z 2025-05-07T20:33:32.9610252Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:32.9611034Z 2025-05-07T20:33:32.9611177Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:32.9611727Z self=, 2025-05-07T20:33:32.9612264Z T=2048, 2025-05-07T20:33:32.9612501Z D=5120, 2025-05-07T20:33:32.9612741Z scale_ub=1200.0, 2025-05-07T20:33:32.9613017Z contiguous=True, 2025-05-07T20:33:32.9613305Z compiled=False, 2025-05-07T20:33:32.9613578Z ) 2025-05-07T20:33:32.9613991Z self = 2025-05-07T20:33:32.9614670Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:32.9615044Z 2025-05-07T20:33:32.9615145Z @given( 2025-05-07T20:33:32.9615439Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:32.9615850Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:32.9616265Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:32.9616703Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:32.9617143Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:32.9617534Z ) 2025-05-07T20:33:32.9618002Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:32.9618604Z def test_silu_mul_quant( 2025-05-07T20:33:32.9618925Z self, 2025-05-07T20:33:32.9619195Z T: int, 2025-05-07T20:33:32.9619450Z D: int, 2025-05-07T20:33:32.9619738Z scale_ub: Optional[float], 2025-05-07T20:33:32.9620099Z contiguous: bool, 2025-05-07T20:33:32.9620409Z compiled: bool, 2025-05-07T20:33:32.9620712Z ) -> None: 2025-05-07T20:33:32.9621010Z torch.manual_seed(2025) 2025-05-07T20:33:32.9621353Z 2025-05-07T20:33:32.9621729Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:32.9622235Z 2025-05-07T20:33:32.9622494Z x_sign = torch.sign(x) 2025-05-07T20:33:32.9622899Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:32.9623352Z x = x_sign * x_clamp 2025-05-07T20:33:32.9623888Z x0 = x[:, :D] 2025-05-07T20:33:32.9624214Z x1 = x[:, D:] 2025-05-07T20:33:32.9624515Z 2025-05-07T20:33:32.9624783Z if contiguous: 2025-05-07T20:33:32.9625106Z x0 = x0.contiguous() 2025-05-07T20:33:32.9625484Z x1 = x1.contiguous() 2025-05-07T20:33:32.9625829Z 2025-05-07T20:33:32.9626106Z if scale_ub is not None: 2025-05-07T20:33:32.9626499Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:32.9626974Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:32.9627517Z ) 2025-05-07T20:33:32.9627787Z else: 2025-05-07T20:33:32.9628068Z scale_ub_tensor = None 2025-05-07T20:33:32.9628419Z 2025-05-07T20:33:32.9628734Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:32.9629180Z op = silu_mul_quant 2025-05-07T20:33:32.9629530Z if compiled: 2025-05-07T20:33:32.9630025Z op = torch.compile(op) 2025-05-07T20:33:32.9630457Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:32.9630845Z 2025-05-07T20:33:32.9631113Z > y_fp8, y_scale = fn() 2025-05-07T20:33:32.9631344Z 2025-05-07T20:33:32.9631489Z moe/activation_test.py:117: 2025-05-07T20:33:32.9631980Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:32.9632470Z moe/activation_test.py:115: in fn 2025-05-07T20:33:32.9632872Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:32.9633887Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:32.9634931Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:32.9635735Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:32.9636820Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:32.9637826Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:32.9638624Z kernel = self.compile( 2025-05-07T20:33:32.9639428Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:32.9640409Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:32.9640978Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:32.9641322Z 2025-05-07T20:33:32.9641616Z self = 2025-05-07T20:33:32.9643240Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:32.9645370Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f58dd9ad5e0>} 2025-05-07T20:33:32.9647407Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:32.9648988Z context = 2025-05-07T20:33:32.9649447Z 2025-05-07T20:33:32.9649658Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:32.9650375Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:32.9651064Z module_map=module_map) 2025-05-07T20:33:32.9651589Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:32.9652098Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:32.9652461Z E ^ 2025-05-07T20:33:32.9653211Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:32.9653912Z 2025-05-07T20:33:32.9654538Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:32.9655317Z 2025-05-07T20:33:32.9655472Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:32.9656073Z self=, 2025-05-07T20:33:32.9656671Z T=2048, 2025-05-07T20:33:32.9656939Z D=5120, 2025-05-07T20:33:32.9657255Z scale_ub=1200.0, 2025-05-07T20:33:32.9657568Z contiguous=True, 2025-05-07T20:33:32.9657876Z compiled=True, 2025-05-07T20:33:32.9658153Z ) 2025-05-07T20:33:32.9658610Z self = 2025-05-07T20:33:32.9659341Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:32.9659744Z 2025-05-07T20:33:32.9659857Z @given( 2025-05-07T20:33:32.9660172Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:32.9660625Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:32.9661076Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:32.9661553Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:32.9662103Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:32.9662529Z ) 2025-05-07T20:33:32.9663033Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:32.9663694Z def test_silu_mul_quant( 2025-05-07T20:33:32.9664042Z self, 2025-05-07T20:33:32.9664314Z T: int, 2025-05-07T20:33:32.9664586Z D: int, 2025-05-07T20:33:32.9664894Z scale_ub: Optional[float], 2025-05-07T20:33:32.9665338Z contiguous: bool, 2025-05-07T20:33:32.9665670Z compiled: bool, 2025-05-07T20:33:32.9665988Z ) -> None: 2025-05-07T20:33:32.9666291Z torch.manual_seed(2025) 2025-05-07T20:33:32.9666629Z 2025-05-07T20:33:32.9667019Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:32.9667519Z 2025-05-07T20:33:32.9667784Z x_sign = torch.sign(x) 2025-05-07T20:33:32.9668201Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:32.9668648Z x = x_sign * x_clamp 2025-05-07T20:33:32.9668990Z x0 = x[:, :D] 2025-05-07T20:33:32.9669297Z x1 = x[:, D:] 2025-05-07T20:33:32.9669593Z 2025-05-07T20:33:32.9669970Z if contiguous: 2025-05-07T20:33:32.9670303Z x0 = x0.contiguous() 2025-05-07T20:33:32.9670671Z x1 = x1.contiguous() 2025-05-07T20:33:32.9671014Z 2025-05-07T20:33:32.9671272Z if scale_ub is not None: 2025-05-07T20:33:32.9671658Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:32.9672154Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:32.9672602Z ) 2025-05-07T20:33:32.9672877Z else: 2025-05-07T20:33:32.9673181Z scale_ub_tensor = None 2025-05-07T20:33:32.9673543Z 2025-05-07T20:33:32.9673863Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:32.9674316Z op = silu_mul_quant 2025-05-07T20:33:32.9674662Z if compiled: 2025-05-07T20:33:32.9675011Z op = torch.compile(op) 2025-05-07T20:33:32.9675439Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:32.9675827Z 2025-05-07T20:33:32.9676093Z y_fp8, y_scale = fn() 2025-05-07T20:33:32.9676495Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:32.9676911Z 2025-05-07T20:33:32.9677241Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:32.9677725Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:32.9678153Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:32.9678616Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:32.9679181Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:32.9679705Z 2025-05-07T20:33:32.9679971Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:32.9680258Z 2025-05-07T20:33:32.9680396Z moe/activation_test.py:126: 2025-05-07T20:33:32.9680826Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:32.9681316Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:32.9681784Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:32.9683206Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:32.9684491Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:32.9685286Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:32.9686336Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:32.9687389Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:32.9688476Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:32.9689706Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:33:32.9690841Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:32.9691943Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:32.9692905Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:32.9693814Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:32.9694689Z fn() 2025-05-07T20:33:32.9695439Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:32.9696299Z self.fn.run( 2025-05-07T20:33:32.9696986Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:32.9697775Z kernel = self.compile( 2025-05-07T20:33:32.9698551Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:32.9699503Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:32.9700076Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:32.9700414Z 2025-05-07T20:33:32.9700710Z self = 2025-05-07T20:33:32.9702333Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:32.9704424Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f59000565e0>} 2025-05-07T20:33:32.9706464Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:32.9707889Z context = 2025-05-07T20:33:32.9708203Z 2025-05-07T20:33:32.9708382Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:32.9708956Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:32.9709483Z module_map=module_map) 2025-05-07T20:33:32.9710001Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:32.9710372Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:32.9710764Z E ^ 2025-05-07T20:33:32.9711268Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:32.9711760Z 2025-05-07T20:33:32.9712221Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:32.9712778Z 2025-05-07T20:33:32.9712879Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:32.9713308Z self=, 2025-05-07T20:33:32.9713801Z T=16384, 2025-05-07T20:33:32.9713990Z D=7168, 2025-05-07T20:33:32.9714185Z scale_ub=1200.0, 2025-05-07T20:33:32.9714408Z contiguous=False, 2025-05-07T20:33:32.9714636Z compiled=False, 2025-05-07T20:33:32.9714839Z ) 2025-05-07T20:33:32.9715164Z self = 2025-05-07T20:33:32.9715700Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:32.9716005Z 2025-05-07T20:33:32.9716083Z @given( 2025-05-07T20:33:32.9716314Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:32.9716632Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:32.9716946Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:32.9717365Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:32.9717704Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:32.9717991Z ) 2025-05-07T20:33:32.9718352Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:32.9718814Z def test_silu_mul_quant( 2025-05-07T20:33:32.9719055Z self, 2025-05-07T20:33:32.9719250Z T: int, 2025-05-07T20:33:32.9719499Z D: int, 2025-05-07T20:33:32.9719714Z scale_ub: Optional[float], 2025-05-07T20:33:32.9719993Z contiguous: bool, 2025-05-07T20:33:32.9720237Z compiled: bool, 2025-05-07T20:33:32.9720458Z ) -> None: 2025-05-07T20:33:32.9720676Z torch.manual_seed(2025) 2025-05-07T20:33:32.9720920Z 2025-05-07T20:33:32.9721187Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:32.9721543Z 2025-05-07T20:33:32.9721742Z x_sign = torch.sign(x) 2025-05-07T20:33:32.9722034Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:32.9722359Z x = x_sign * x_clamp 2025-05-07T20:33:32.9722602Z x0 = x[:, :D] 2025-05-07T20:33:32.9722821Z x1 = x[:, D:] 2025-05-07T20:33:32.9723029Z 2025-05-07T20:33:32.9723213Z if contiguous: 2025-05-07T20:33:32.9723449Z x0 = x0.contiguous() 2025-05-07T20:33:32.9723703Z x1 = x1.contiguous() 2025-05-07T20:33:32.9723944Z 2025-05-07T20:33:32.9724139Z if scale_ub is not None: 2025-05-07T20:33:32.9724411Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:32.9724757Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:32.9725079Z ) 2025-05-07T20:33:32.9725267Z else: 2025-05-07T20:33:32.9725478Z scale_ub_tensor = None 2025-05-07T20:33:32.9725737Z 2025-05-07T20:33:32.9725962Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:32.9726289Z op = silu_mul_quant 2025-05-07T20:33:32.9726548Z if compiled: 2025-05-07T20:33:32.9726792Z op = torch.compile(op) 2025-05-07T20:33:32.9727092Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:32.9727377Z 2025-05-07T20:33:32.9727563Z > y_fp8, y_scale = fn() 2025-05-07T20:33:32.9727738Z 2025-05-07T20:33:32.9727834Z moe/activation_test.py:117: 2025-05-07T20:33:32.9728134Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:32.9728483Z moe/activation_test.py:115: in fn 2025-05-07T20:33:32.9728763Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:32.9729581Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:32.9730337Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:32.9730901Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:32.9731638Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:32.9732349Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:32.9732965Z kernel = self.compile( 2025-05-07T20:33:32.9733533Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:32.9734232Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:32.9734648Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:32.9734893Z 2025-05-07T20:33:32.9735114Z self = 2025-05-07T20:33:32.9736276Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:32.9737832Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f58fffea280>} 2025-05-07T20:33:32.9739305Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:32.9740456Z context = 2025-05-07T20:33:32.9740757Z 2025-05-07T20:33:32.9740927Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:32.9741480Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:32.9741974Z module_map=module_map) 2025-05-07T20:33:32.9742350Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:32.9742703Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:32.9742973Z E ^ 2025-05-07T20:33:32.9743464Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:32.9743952Z 2025-05-07T20:33:32.9744405Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:32.9744958Z 2025-05-07T20:33:32.9745060Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:32.9745490Z self=, 2025-05-07T20:33:32.9745909Z T=1, 2025-05-07T20:33:32.9746089Z D=7168, 2025-05-07T20:33:32.9746289Z scale_ub=None, 2025-05-07T20:33:32.9746505Z contiguous=True, 2025-05-07T20:33:32.9746719Z compiled=True, 2025-05-07T20:33:32.9746921Z ) 2025-05-07T20:33:32.9747250Z self = 2025-05-07T20:33:32.9747747Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:32.9748027Z 2025-05-07T20:33:32.9748105Z @given( 2025-05-07T20:33:32.9748337Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:32.9748659Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:32.9748970Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:32.9749309Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:32.9749646Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:32.9750057Z ) 2025-05-07T20:33:32.9750417Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:32.9750882Z def test_silu_mul_quant( 2025-05-07T20:33:32.9751178Z self, 2025-05-07T20:33:32.9751373Z T: int, 2025-05-07T20:33:32.9751565Z D: int, 2025-05-07T20:33:32.9751777Z scale_ub: Optional[float], 2025-05-07T20:33:32.9752053Z contiguous: bool, 2025-05-07T20:33:32.9752296Z compiled: bool, 2025-05-07T20:33:32.9752517Z ) -> None: 2025-05-07T20:33:32.9752730Z torch.manual_seed(2025) 2025-05-07T20:33:32.9752972Z 2025-05-07T20:33:32.9753251Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:32.9753647Z 2025-05-07T20:33:32.9753842Z x_sign = torch.sign(x) 2025-05-07T20:33:32.9754136Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:32.9754450Z x = x_sign * x_clamp 2025-05-07T20:33:32.9754696Z x0 = x[:, :D] 2025-05-07T20:33:32.9754915Z x1 = x[:, D:] 2025-05-07T20:33:32.9755119Z 2025-05-07T20:33:32.9755303Z if contiguous: 2025-05-07T20:33:32.9755537Z x0 = x0.contiguous() 2025-05-07T20:33:32.9755797Z x1 = x1.contiguous() 2025-05-07T20:33:32.9756039Z 2025-05-07T20:33:32.9756230Z if scale_ub is not None: 2025-05-07T20:33:32.9756501Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:32.9756842Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:32.9757202Z ) 2025-05-07T20:33:32.9757387Z else: 2025-05-07T20:33:32.9757597Z scale_ub_tensor = None 2025-05-07T20:33:32.9757852Z 2025-05-07T20:33:32.9758081Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:32.9758397Z op = silu_mul_quant 2025-05-07T20:33:32.9758653Z if compiled: 2025-05-07T20:33:32.9758900Z op = torch.compile(op) 2025-05-07T20:33:32.9759241Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:32.9759519Z 2025-05-07T20:33:32.9759711Z y_fp8, y_scale = fn() 2025-05-07T20:33:32.9759994Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:32.9760295Z 2025-05-07T20:33:32.9760531Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:32.9760869Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:32.9761170Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:32.9761499Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:32.9761865Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:32.9762177Z 2025-05-07T20:33:32.9762379Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:32.9762583Z 2025-05-07T20:33:32.9762691Z moe/activation_test.py:126: 2025-05-07T20:33:32.9762987Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:32.9763339Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:32.9763677Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:32.9764515Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:32.9765325Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:32.9765902Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:32.9766635Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:32.9767364Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:32.9768140Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:32.9768949Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:33:32.9769753Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:32.9770586Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:32.9771270Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:32.9771910Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:32.9772457Z fn() 2025-05-07T20:33:32.9772995Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:32.9773614Z self.fn.run( 2025-05-07T20:33:32.9774107Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:32.9774706Z kernel = self.compile( 2025-05-07T20:33:32.9775277Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:32.9775974Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:32.9776390Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:32.9776637Z 2025-05-07T20:33:32.9776850Z self = 2025-05-07T20:33:32.9778056Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:32.9779610Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f58fffff940>} 2025-05-07T20:33:32.9781070Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:32.9782212Z context = 2025-05-07T20:33:32.9782519Z 2025-05-07T20:33:32.9782691Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:32.9783509Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:32.9783997Z module_map=module_map) 2025-05-07T20:33:32.9784369Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:32.9784731Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:32.9785003Z E ^ 2025-05-07T20:33:32.9785483Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:32.9785980Z 2025-05-07T20:33:32.9786425Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:32.9786989Z 2025-05-07T20:33:32.9787087Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:32.9787513Z self=, 2025-05-07T20:33:32.9787925Z T=4096, 2025-05-07T20:33:32.9788110Z D=5120, 2025-05-07T20:33:32.9788300Z scale_ub=None, 2025-05-07T20:33:32.9788511Z contiguous=False, 2025-05-07T20:33:32.9788739Z compiled=False, 2025-05-07T20:33:32.9788966Z ) 2025-05-07T20:33:32.9789309Z self = 2025-05-07T20:33:32.9789905Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:32.9790199Z 2025-05-07T20:33:32.9790275Z @given( 2025-05-07T20:33:32.9790505Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:32.9790818Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:32.9791130Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:32.9791465Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:32.9791799Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:32.9792097Z ) 2025-05-07T20:33:32.9792545Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:32.9793062Z def test_silu_mul_quant( 2025-05-07T20:33:32.9807415Z self, 2025-05-07T20:33:32.9807690Z T: int, 2025-05-07T20:33:32.9807915Z D: int, 2025-05-07T20:33:32.9808158Z scale_ub: Optional[float], 2025-05-07T20:33:32.9808457Z contiguous: bool, 2025-05-07T20:33:32.9808720Z compiled: bool, 2025-05-07T20:33:32.9808968Z ) -> None: 2025-05-07T20:33:32.9809196Z torch.manual_seed(2025) 2025-05-07T20:33:32.9809464Z 2025-05-07T20:33:32.9809885Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:32.9810264Z 2025-05-07T20:33:32.9810478Z x_sign = torch.sign(x) 2025-05-07T20:33:32.9810795Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:32.9811129Z x = x_sign * x_clamp 2025-05-07T20:33:32.9811396Z x0 = x[:, :D] 2025-05-07T20:33:32.9811634Z x1 = x[:, D:] 2025-05-07T20:33:32.9811890Z 2025-05-07T20:33:32.9812161Z if contiguous: 2025-05-07T20:33:32.9812417Z x0 = x0.contiguous() 2025-05-07T20:33:32.9812702Z x1 = x1.contiguous() 2025-05-07T20:33:32.9812957Z 2025-05-07T20:33:32.9813164Z if scale_ub is not None: 2025-05-07T20:33:32.9813464Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:32.9813964Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:32.9814305Z ) 2025-05-07T20:33:32.9814517Z else: 2025-05-07T20:33:32.9814743Z scale_ub_tensor = None 2025-05-07T20:33:32.9815019Z 2025-05-07T20:33:32.9815270Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:32.9815609Z op = silu_mul_quant 2025-05-07T20:33:32.9815949Z if compiled: 2025-05-07T20:33:32.9816215Z op = torch.compile(op) 2025-05-07T20:33:32.9816524Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:32.9816824Z 2025-05-07T20:33:32.9817030Z > y_fp8, y_scale = fn() 2025-05-07T20:33:32.9817206Z 2025-05-07T20:33:32.9817322Z moe/activation_test.py:117: 2025-05-07T20:33:32.9817637Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:32.9818001Z moe/activation_test.py:115: in fn 2025-05-07T20:33:32.9818299Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:32.9819109Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:32.9819871Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:32.9820446Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:32.9821191Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:32.9821914Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:32.9822490Z kernel = self.compile( 2025-05-07T20:33:32.9823135Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:32.9824056Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:32.9828882Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:32.9829133Z 2025-05-07T20:33:32.9829358Z self = 2025-05-07T20:33:32.9830662Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:32.9852871Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f58ffeb8430>} 2025-05-07T20:33:32.9854430Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:32.9855538Z context = 2025-05-07T20:33:32.9855847Z 2025-05-07T20:33:32.9856018Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:32.9856563Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:32.9857095Z module_map=module_map) 2025-05-07T20:33:32.9857466Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:32.9857824Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:32.9858091Z E ^ 2025-05-07T20:33:32.9858581Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:32.9859118Z 2025-05-07T20:33:32.9859572Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:32.9860128Z 2025-05-07T20:33:32.9860235Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:32.9860660Z self=, 2025-05-07T20:33:32.9861120Z T=4096, 2025-05-07T20:33:32.9861308Z D=7168, 2025-05-07T20:33:32.9861495Z scale_ub=None, 2025-05-07T20:33:32.9861707Z contiguous=False, 2025-05-07T20:33:32.9861927Z compiled=False, 2025-05-07T20:33:32.9862135Z ) 2025-05-07T20:33:32.9862456Z self = 2025-05-07T20:33:32.9862973Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:32.9863310Z 2025-05-07T20:33:32.9863388Z @given( 2025-05-07T20:33:32.9863614Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:32.9863927Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:32.9864244Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:32.9864581Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:32.9864914Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:32.9865210Z ) 2025-05-07T20:33:32.9865569Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:32.9866031Z def test_silu_mul_quant( 2025-05-07T20:33:32.9866271Z self, 2025-05-07T20:33:32.9866455Z T: int, 2025-05-07T20:33:32.9866650Z D: int, 2025-05-07T20:33:32.9866860Z scale_ub: Optional[float], 2025-05-07T20:33:32.9867127Z contiguous: bool, 2025-05-07T20:33:32.9867366Z compiled: bool, 2025-05-07T20:33:32.9867587Z ) -> None: 2025-05-07T20:33:32.9867799Z torch.manual_seed(2025) 2025-05-07T20:33:32.9868039Z 2025-05-07T20:33:32.9868310Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:32.9868667Z 2025-05-07T20:33:32.9868864Z x_sign = torch.sign(x) 2025-05-07T20:33:32.9869155Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:32.9869473Z x = x_sign * x_clamp 2025-05-07T20:33:32.9869711Z x0 = x[:, :D] 2025-05-07T20:33:32.9870037Z x1 = x[:, D:] 2025-05-07T20:33:32.9870252Z 2025-05-07T20:33:32.9870445Z if contiguous: 2025-05-07T20:33:32.9870676Z x0 = x0.contiguous() 2025-05-07T20:33:32.9870944Z x1 = x1.contiguous() 2025-05-07T20:33:32.9871191Z 2025-05-07T20:33:32.9871388Z if scale_ub is not None: 2025-05-07T20:33:32.9871663Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:32.9872011Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:32.9872340Z ) 2025-05-07T20:33:32.9872529Z else: 2025-05-07T20:33:32.9872742Z scale_ub_tensor = None 2025-05-07T20:33:32.9873006Z 2025-05-07T20:33:32.9873288Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:32.9873618Z op = silu_mul_quant 2025-05-07T20:33:32.9873882Z if compiled: 2025-05-07T20:33:32.9874129Z op = torch.compile(op) 2025-05-07T20:33:32.9874437Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:32.9874729Z 2025-05-07T20:33:32.9874921Z > y_fp8, y_scale = fn() 2025-05-07T20:33:32.9875096Z 2025-05-07T20:33:32.9875196Z moe/activation_test.py:117: 2025-05-07T20:33:32.9875501Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:32.9875893Z moe/activation_test.py:115: in fn 2025-05-07T20:33:32.9876175Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:32.9876916Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:32.9877668Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:32.9878234Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:32.9879010Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:32.9879731Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:32.9880301Z kernel = self.compile( 2025-05-07T20:33:32.9880910Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:32.9881615Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:32.9882033Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:32.9882275Z 2025-05-07T20:33:32.9882494Z self = 2025-05-07T20:33:32.9884003Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:32.9885517Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f58ffbdedc0>} 2025-05-07T20:33:32.9886994Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:32.9888105Z context = 2025-05-07T20:33:32.9888411Z 2025-05-07T20:33:32.9888580Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:32.9889159Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:32.9889682Z module_map=module_map) 2025-05-07T20:33:32.9890065Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:32.9890427Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:32.9890695Z E ^ 2025-05-07T20:33:32.9891191Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:32.9891681Z 2025-05-07T20:33:32.9892131Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:32.9892693Z 2025-05-07T20:33:32.9892796Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:32.9893229Z self=, 2025-05-07T20:33:32.9893651Z T=128, 2025-05-07T20:33:32.9893833Z D=7168, 2025-05-07T20:33:32.9894029Z scale_ub=None, 2025-05-07T20:33:32.9894251Z contiguous=False, 2025-05-07T20:33:32.9894476Z compiled=True, 2025-05-07T20:33:32.9894683Z ) 2025-05-07T20:33:32.9895008Z self = 2025-05-07T20:33:32.9895617Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:32.9895915Z 2025-05-07T20:33:32.9895998Z @given( 2025-05-07T20:33:32.9896237Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:32.9896570Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:32.9896890Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:32.9897241Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:32.9897589Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:32.9897955Z ) 2025-05-07T20:33:32.9898316Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:32.9898786Z def test_silu_mul_quant( 2025-05-07T20:33:32.9899030Z self, 2025-05-07T20:33:32.9899230Z T: int, 2025-05-07T20:33:32.9899430Z D: int, 2025-05-07T20:33:32.9899644Z scale_ub: Optional[float], 2025-05-07T20:33:32.9899923Z contiguous: bool, 2025-05-07T20:33:32.9900177Z compiled: bool, 2025-05-07T20:33:32.9900399Z ) -> None: 2025-05-07T20:33:32.9900618Z torch.manual_seed(2025) 2025-05-07T20:33:32.9900868Z 2025-05-07T20:33:32.9901140Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:32.9901500Z 2025-05-07T20:33:32.9901828Z x_sign = torch.sign(x) 2025-05-07T20:33:32.9902130Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:32.9902448Z x = x_sign * x_clamp 2025-05-07T20:33:32.9902699Z x0 = x[:, :D] 2025-05-07T20:33:32.9902923Z x1 = x[:, D:] 2025-05-07T20:33:32.9903128Z 2025-05-07T20:33:32.9903320Z if contiguous: 2025-05-07T20:33:32.9903556Z x0 = x0.contiguous() 2025-05-07T20:33:32.9903904Z x1 = x1.contiguous() 2025-05-07T20:33:32.9904160Z 2025-05-07T20:33:32.9904363Z if scale_ub is not None: 2025-05-07T20:33:32.9904643Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:32.9904994Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:32.9905320Z ) 2025-05-07T20:33:32.9905514Z else: 2025-05-07T20:33:32.9905730Z scale_ub_tensor = None 2025-05-07T20:33:32.9905988Z 2025-05-07T20:33:32.9906219Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:32.9906546Z op = silu_mul_quant 2025-05-07T20:33:32.9906804Z if compiled: 2025-05-07T20:33:32.9907055Z op = torch.compile(op) 2025-05-07T20:33:32.9907359Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:32.9907643Z 2025-05-07T20:33:32.9907841Z y_fp8, y_scale = fn() 2025-05-07T20:33:32.9908129Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:32.9908435Z 2025-05-07T20:33:32.9908676Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:32.9909019Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:32.9909326Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:32.9909655Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:32.9910122Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:32.9910449Z 2025-05-07T20:33:32.9910653Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:32.9910857Z 2025-05-07T20:33:32.9910964Z moe/activation_test.py:126: 2025-05-07T20:33:32.9911265Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:32.9911614Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:32.9911951Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:32.9912795Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:32.9913623Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:32.9914252Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:32.9914991Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:32.9915720Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:32.9916494Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:32.9917302Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:33:32.9918148Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:32.9918925Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:32.9919611Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:32.9920253Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:32.9920801Z fn() 2025-05-07T20:33:32.9921339Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:32.9921964Z self.fn.run( 2025-05-07T20:33:32.9922504Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:32.9923065Z kernel = self.compile( 2025-05-07T20:33:32.9923641Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:32.9924338Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:32.9924747Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:32.9925044Z 2025-05-07T20:33:32.9925255Z self = 2025-05-07T20:33:32.9926428Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:32.9927937Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f58ff752160>} 2025-05-07T20:33:32.9929455Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:32.9930561Z context = 2025-05-07T20:33:32.9930870Z 2025-05-07T20:33:32.9931046Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:32.9931607Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:32.9932097Z module_map=module_map) 2025-05-07T20:33:32.9932475Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:32.9932843Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:32.9933113Z E ^ 2025-05-07T20:33:32.9933605Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:32.9934095Z 2025-05-07T20:33:32.9934552Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:32.9935112Z 2025-05-07T20:33:32.9935220Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:32.9935644Z self=, 2025-05-07T20:33:32.9936067Z T=128, 2025-05-07T20:33:32.9936258Z D=7168, 2025-05-07T20:33:32.9936447Z scale_ub=None, 2025-05-07T20:33:32.9936665Z contiguous=False, 2025-05-07T20:33:32.9936898Z compiled=False, 2025-05-07T20:33:32.9937100Z ) 2025-05-07T20:33:32.9937475Z self = 2025-05-07T20:33:32.9937992Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:32.9938274Z 2025-05-07T20:33:32.9938355Z @given( 2025-05-07T20:33:32.9938581Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:32.9938906Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:32.9939263Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:32.9939640Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:32.9939981Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:32.9940279Z ) 2025-05-07T20:33:32.9940629Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:32.9941096Z def test_silu_mul_quant( 2025-05-07T20:33:32.9941340Z self, 2025-05-07T20:33:32.9941529Z T: int, 2025-05-07T20:33:32.9941732Z D: int, 2025-05-07T20:33:32.9941956Z scale_ub: Optional[float], 2025-05-07T20:33:32.9942224Z contiguous: bool, 2025-05-07T20:33:32.9942467Z compiled: bool, 2025-05-07T20:33:32.9942688Z ) -> None: 2025-05-07T20:33:32.9942902Z torch.manual_seed(2025) 2025-05-07T20:33:32.9943141Z 2025-05-07T20:33:32.9943465Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:32.9943821Z 2025-05-07T20:33:32.9944007Z x_sign = torch.sign(x) 2025-05-07T20:33:32.9944303Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:32.9944621Z x = x_sign * x_clamp 2025-05-07T20:33:32.9944862Z x0 = x[:, :D] 2025-05-07T20:33:32.9945078Z x1 = x[:, D:] 2025-05-07T20:33:32.9945288Z 2025-05-07T20:33:32.9945516Z if contiguous: 2025-05-07T20:33:32.9945746Z x0 = x0.contiguous() 2025-05-07T20:33:32.9946008Z x1 = x1.contiguous() 2025-05-07T20:33:32.9946247Z 2025-05-07T20:33:32.9946440Z if scale_ub is not None: 2025-05-07T20:33:32.9946716Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:32.9947052Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:32.9947373Z ) 2025-05-07T20:33:32.9947565Z else: 2025-05-07T20:33:32.9947777Z scale_ub_tensor = None 2025-05-07T20:33:32.9948035Z 2025-05-07T20:33:32.9948261Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:32.9948587Z op = silu_mul_quant 2025-05-07T20:33:32.9948839Z if compiled: 2025-05-07T20:33:32.9949088Z op = torch.compile(op) 2025-05-07T20:33:32.9949393Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:32.9949673Z 2025-05-07T20:33:32.9949942Z > y_fp8, y_scale = fn() 2025-05-07T20:33:32.9950114Z 2025-05-07T20:33:32.9950218Z moe/activation_test.py:117: 2025-05-07T20:33:32.9950512Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:32.9950861Z moe/activation_test.py:115: in fn 2025-05-07T20:33:32.9951145Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:32.9951888Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:32.9952627Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:32.9953194Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:32.9953926Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:32.9954628Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:32.9955194Z kernel = self.compile( 2025-05-07T20:33:32.9955767Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:32.9956518Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:32.9956927Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:32.9957173Z 2025-05-07T20:33:32.9957384Z self = 2025-05-07T20:33:32.9958547Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:32.9960093Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f58ff6dd940>} 2025-05-07T20:33:32.9961553Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:32.9962662Z context = 2025-05-07T20:33:32.9962971Z 2025-05-07T20:33:32.9963138Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:32.9963687Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:32.9964213Z module_map=module_map) 2025-05-07T20:33:32.9964591Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:32.9964954Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:32.9965219Z E ^ 2025-05-07T20:33:32.9965700Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:32.9966191Z 2025-05-07T20:33:32.9966636Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:32.9967232Z 2025-05-07T20:33:32.9967340Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:32.9967761Z self=, 2025-05-07T20:33:32.9968181Z T=4096, 2025-05-07T20:33:32.9968371Z D=5120, 2025-05-07T20:33:32.9968564Z scale_ub=1200.0, 2025-05-07T20:33:32.9968779Z contiguous=True, 2025-05-07T20:33:32.9968998Z compiled=False, 2025-05-07T20:33:32.9969206Z ) 2025-05-07T20:33:32.9969526Z self = 2025-05-07T20:33:32.9970047Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:32.9970337Z 2025-05-07T20:33:32.9970430Z @given( 2025-05-07T20:33:32.9970654Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:32.9970972Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:32.9971292Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:32.9971626Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:32.9971964Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:32.9972262Z ) 2025-05-07T20:33:32.9972622Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:32.9973079Z def test_silu_mul_quant( 2025-05-07T20:33:32.9973326Z self, 2025-05-07T20:33:32.9973524Z T: int, 2025-05-07T20:33:32.9973714Z D: int, 2025-05-07T20:33:32.9973933Z scale_ub: Optional[float], 2025-05-07T20:33:32.9974209Z contiguous: bool, 2025-05-07T20:33:32.9974446Z compiled: bool, 2025-05-07T20:33:32.9974672Z ) -> None: 2025-05-07T20:33:32.9974891Z torch.manual_seed(2025) 2025-05-07T20:33:32.9975127Z 2025-05-07T20:33:32.9975399Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:32.9975753Z 2025-05-07T20:33:32.9975943Z x_sign = torch.sign(x) 2025-05-07T20:33:32.9976236Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:32.9976553Z x = x_sign * x_clamp 2025-05-07T20:33:32.9976836Z x0 = x[:, :D] 2025-05-07T20:33:32.9977059Z x1 = x[:, D:] 2025-05-07T20:33:32.9977269Z 2025-05-07T20:33:32.9977451Z if contiguous: 2025-05-07T20:33:32.9977677Z x0 = x0.contiguous() 2025-05-07T20:33:32.9977939Z x1 = x1.contiguous() 2025-05-07T20:33:32.9978178Z 2025-05-07T20:33:32.9978365Z if scale_ub is not None: 2025-05-07T20:33:32.9978641Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:32.9978982Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:32.9979366Z ) 2025-05-07T20:33:32.9979557Z else: 2025-05-07T20:33:32.9979762Z scale_ub_tensor = None 2025-05-07T20:33:32.9980012Z 2025-05-07T20:33:32.9980242Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:32.9980567Z op = silu_mul_quant 2025-05-07T20:33:32.9980817Z if compiled: 2025-05-07T20:33:32.9981066Z op = torch.compile(op) 2025-05-07T20:33:32.9981376Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:32.9981654Z 2025-05-07T20:33:32.9981851Z > y_fp8, y_scale = fn() 2025-05-07T20:33:32.9982023Z 2025-05-07T20:33:32.9982118Z moe/activation_test.py:117: 2025-05-07T20:33:32.9982423Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:32.9983029Z moe/activation_test.py:115: in fn 2025-05-07T20:33:32.9987980Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:32.9988736Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:32.9989486Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:32.9990171Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:32.9991027Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:32.9991741Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:32.9992305Z kernel = self.compile( 2025-05-07T20:33:32.9992878Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:32.9993581Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:32.9994002Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:32.9994341Z 2025-05-07T20:33:32.9994610Z self = 2025-05-07T20:33:32.9995848Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:32.9997352Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f58ff73a8b0>} 2025-05-07T20:33:32.9998810Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:32.9999913Z context = 2025-05-07T20:33:33.0000215Z 2025-05-07T20:33:33.0000390Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:33.0000936Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:33.0001425Z module_map=module_map) 2025-05-07T20:33:33.0001797Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:33.0002155Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:33.0002415Z E ^ 2025-05-07T20:33:33.0002986Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:33.0003476Z 2025-05-07T20:33:33.0003926Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:33.0004480Z 2025-05-07T20:33:33.0004585Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:33.0005013Z self=, 2025-05-07T20:33:33.0005430Z T=1, 2025-05-07T20:33:33.0005611Z D=5120, 2025-05-07T20:33:33.0005795Z scale_ub=None, 2025-05-07T20:33:33.0006073Z contiguous=True, 2025-05-07T20:33:33.0006292Z compiled=True, 2025-05-07T20:33:33.0006486Z ) 2025-05-07T20:33:33.0006810Z self = 2025-05-07T20:33:33.0007315Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:33.0007585Z 2025-05-07T20:33:33.0007660Z @given( 2025-05-07T20:33:33.0007888Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:33.0008209Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:33.0008519Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:33.0008848Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:33.0009181Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:33.0009512Z ) 2025-05-07T20:33:33.0009865Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:33.0010324Z def test_silu_mul_quant( 2025-05-07T20:33:33.0010566Z self, 2025-05-07T20:33:33.0010751Z T: int, 2025-05-07T20:33:33.0010943Z D: int, 2025-05-07T20:33:33.0011159Z scale_ub: Optional[float], 2025-05-07T20:33:33.0011249Z contiguous: bool, 2025-05-07T20:33:33.0011376Z compiled: bool, 2025-05-07T20:33:33.0011460Z ) -> None: 2025-05-07T20:33:33.0011557Z torch.manual_seed(2025) 2025-05-07T20:33:33.0011638Z 2025-05-07T20:33:33.0011815Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:33.0011887Z 2025-05-07T20:33:33.0011984Z x_sign = torch.sign(x) 2025-05-07T20:33:33.0012108Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:33.0012194Z x = x_sign * x_clamp 2025-05-07T20:33:33.0012279Z x0 = x[:, :D] 2025-05-07T20:33:33.0012359Z x1 = x[:, D:] 2025-05-07T20:33:33.0012432Z 2025-05-07T20:33:33.0012521Z if contiguous: 2025-05-07T20:33:33.0012615Z x0 = x0.contiguous() 2025-05-07T20:33:33.0012706Z x1 = x1.contiguous() 2025-05-07T20:33:33.0012785Z 2025-05-07T20:33:33.0012873Z if scale_ub is not None: 2025-05-07T20:33:33.0012982Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:33.0013121Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:33.0013194Z ) 2025-05-07T20:33:33.0013272Z else: 2025-05-07T20:33:33.0013366Z scale_ub_tensor = None 2025-05-07T20:33:33.0013441Z 2025-05-07T20:33:33.0013582Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:33.0013673Z op = silu_mul_quant 2025-05-07T20:33:33.0013756Z if compiled: 2025-05-07T20:33:33.0013858Z op = torch.compile(op) 2025-05-07T20:33:33.0013966Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:33.0014037Z 2025-05-07T20:33:33.0014131Z y_fp8, y_scale = fn() 2025-05-07T20:33:33.0014252Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:33.0014331Z 2025-05-07T20:33:33.0014465Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:33.0014564Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:33.0014667Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:33.0014790Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:33.0014932Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:33.0015005Z 2025-05-07T20:33:33.0015149Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:33.0015154Z 2025-05-07T20:33:33.0015254Z moe/activation_test.py:126: 2025-05-07T20:33:33.0015390Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:33.0015493Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:33.0015634Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:33.0016243Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:33.0016382Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:33.0016774Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:33.0017009Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:33.0017408Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:33.0017677Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:33.0018105Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:33:33.0018410Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:33.0018815Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:33.0018987Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:33.0019357Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:33.0019476Z fn() 2025-05-07T20:33:33.0019912Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:33.0019997Z self.fn.run( 2025-05-07T20:33:33.0020357Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:33.0020455Z kernel = self.compile( 2025-05-07T20:33:33.0020867Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:33.0021044Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:33.0021182Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:33.0021188Z 2025-05-07T20:33:33.0021399Z self = 2025-05-07T20:33:33.0022255Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:33.0022806Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f58ff44e550>} 2025-05-07T20:33:33.0023628Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:33.0023825Z context = 2025-05-07T20:33:33.0023830Z 2025-05-07T20:33:33.0024000Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:33.0024280Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:33.0024384Z module_map=module_map) 2025-05-07T20:33:33.0024555Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:33.0024654Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:33.0024728Z E ^ 2025-05-07T20:33:33.0025157Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:33.0025163Z 2025-05-07T20:33:33.0025611Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:33.0025616Z 2025-05-07T20:33:33.0025718Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:33.0025950Z self=, 2025-05-07T20:33:33.0026068Z T=2048, 2025-05-07T20:33:33.0026146Z D=5120, 2025-05-07T20:33:33.0026225Z scale_ub=None, 2025-05-07T20:33:33.0026309Z contiguous=True, 2025-05-07T20:33:33.0026395Z compiled=True, 2025-05-07T20:33:33.0026468Z ) 2025-05-07T20:33:33.0026695Z self = 2025-05-07T20:33:33.0026873Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:33.0026878Z 2025-05-07T20:33:33.0026956Z @given( 2025-05-07T20:33:33.0027075Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:33.0027174Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:33.0027288Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:33.0027451Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:33.0027564Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:33.0027635Z ) 2025-05-07T20:33:33.0027896Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:33.0027990Z def test_silu_mul_quant( 2025-05-07T20:33:33.0028066Z self, 2025-05-07T20:33:33.0028149Z T: int, 2025-05-07T20:33:33.0028223Z D: int, 2025-05-07T20:33:33.0028360Z scale_ub: Optional[float], 2025-05-07T20:33:33.0028453Z contiguous: bool, 2025-05-07T20:33:33.0028536Z compiled: bool, 2025-05-07T20:33:33.0028614Z ) -> None: 2025-05-07T20:33:33.0028712Z torch.manual_seed(2025) 2025-05-07T20:33:33.0028787Z 2025-05-07T20:33:33.0028960Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:33.0029034Z 2025-05-07T20:33:33.0029123Z x_sign = torch.sign(x) 2025-05-07T20:33:33.0029247Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:33.0029337Z x = x_sign * x_clamp 2025-05-07T20:33:33.0029414Z x0 = x[:, :D] 2025-05-07T20:33:33.0029500Z x1 = x[:, D:] 2025-05-07T20:33:33.0029570Z 2025-05-07T20:33:33.0029656Z if contiguous: 2025-05-07T20:33:33.0029905Z x0 = x0.contiguous() 2025-05-07T20:33:33.0030004Z x1 = x1.contiguous() 2025-05-07T20:33:33.0030076Z 2025-05-07T20:33:33.0030170Z if scale_ub is not None: 2025-05-07T20:33:33.0030275Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:33.0030411Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:33.0030486Z ) 2025-05-07T20:33:33.0030560Z else: 2025-05-07T20:33:33.0030657Z scale_ub_tensor = None 2025-05-07T20:33:33.0030726Z 2025-05-07T20:33:33.0030854Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:33.0030946Z op = silu_mul_quant 2025-05-07T20:33:33.0031028Z if compiled: 2025-05-07T20:33:33.0031128Z op = torch.compile(op) 2025-05-07T20:33:33.0031235Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:33.0031308Z 2025-05-07T20:33:33.0031396Z y_fp8, y_scale = fn() 2025-05-07T20:33:33.0031522Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:33.0031595Z 2025-05-07T20:33:33.0031732Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:33.0031834Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:33.0031931Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:33.0032054Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:33.0032269Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:33.0032344Z 2025-05-07T20:33:33.0032444Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:33.0032449Z 2025-05-07T20:33:33.0032546Z moe/activation_test.py:126: 2025-05-07T20:33:33.0032678Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:33.0032786Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:33.0032920Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:33.0033581Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:33.0033678Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:33.0034063Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:33.0034298Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:33.0034693Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:33.0034965Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:33.0035432Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:33:33.0035698Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:33.0036104Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:33.0036275Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:33.0036676Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:33.0036755Z fn() 2025-05-07T20:33:33.0037185Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:33.0037271Z self.fn.run( 2025-05-07T20:33:33.0037627Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:33.0037720Z kernel = self.compile( 2025-05-07T20:33:33.0038131Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:33.0038306Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:33.0038437Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:33.0038446Z 2025-05-07T20:33:33.0038655Z self = 2025-05-07T20:33:33.0039509Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:33.0040056Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f58fefa7f70>} 2025-05-07T20:33:33.0040870Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:33.0041071Z context = 2025-05-07T20:33:33.0041078Z 2025-05-07T20:33:33.0041245Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:33.0041519Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:33.0041631Z module_map=module_map) 2025-05-07T20:33:33.0041791Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:33.0041937Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:33.0042017Z E ^ 2025-05-07T20:33:33.0042401Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:33.0042406Z 2025-05-07T20:33:33.0042859Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:33.0042864Z 2025-05-07T20:33:33.0042962Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:33.0043243Z self=, 2025-05-07T20:33:33.0043324Z T=128, 2025-05-07T20:33:33.0043400Z D=5120, 2025-05-07T20:33:33.0043484Z scale_ub=None, 2025-05-07T20:33:33.0043568Z contiguous=True, 2025-05-07T20:33:33.0043653Z compiled=True, 2025-05-07T20:33:33.0043725Z ) 2025-05-07T20:33:33.0043949Z self = 2025-05-07T20:33:33.0044126Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:33.0044130Z 2025-05-07T20:33:33.0044212Z @given( 2025-05-07T20:33:33.0044331Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:33.0044429Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:33.0044591Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:33.0044708Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:33.0044824Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:33.0044899Z ) 2025-05-07T20:33:33.0045156Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:33.0045246Z def test_silu_mul_quant( 2025-05-07T20:33:33.0045321Z self, 2025-05-07T20:33:33.0045436Z T: int, 2025-05-07T20:33:33.0045513Z D: int, 2025-05-07T20:33:33.0045615Z scale_ub: Optional[float], 2025-05-07T20:33:33.0045702Z contiguous: bool, 2025-05-07T20:33:33.0045796Z compiled: bool, 2025-05-07T20:33:33.0045872Z ) -> None: 2025-05-07T20:33:33.0045967Z torch.manual_seed(2025) 2025-05-07T20:33:33.0046041Z 2025-05-07T20:33:33.0046211Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:33.0046288Z 2025-05-07T20:33:33.0046378Z x_sign = torch.sign(x) 2025-05-07T20:33:33.0046501Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:33.0046592Z x = x_sign * x_clamp 2025-05-07T20:33:33.0046670Z x0 = x[:, :D] 2025-05-07T20:33:33.0046748Z x1 = x[:, D:] 2025-05-07T20:33:33.0046822Z 2025-05-07T20:33:33.0046904Z if contiguous: 2025-05-07T20:33:33.0046994Z x0 = x0.contiguous() 2025-05-07T20:33:33.0047086Z x1 = x1.contiguous() 2025-05-07T20:33:33.0047162Z 2025-05-07T20:33:33.0047253Z if scale_ub is not None: 2025-05-07T20:33:33.0047363Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:33.0047500Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:33.0047578Z ) 2025-05-07T20:33:33.0047650Z else: 2025-05-07T20:33:33.0047743Z scale_ub_tensor = None 2025-05-07T20:33:33.0047818Z 2025-05-07T20:33:33.0047945Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:33.0048035Z op = silu_mul_quant 2025-05-07T20:33:33.0048128Z if compiled: 2025-05-07T20:33:33.0048226Z op = torch.compile(op) 2025-05-07T20:33:33.0048332Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:33.0048406Z 2025-05-07T20:33:33.0048495Z y_fp8, y_scale = fn() 2025-05-07T20:33:33.0048613Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:33.0048690Z 2025-05-07T20:33:33.0048831Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:33.0048932Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:33.0049029Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:33.0049200Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:33.0049348Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:33.0049417Z 2025-05-07T20:33:33.0049516Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:33.0049520Z 2025-05-07T20:33:33.0049621Z moe/activation_test.py:126: 2025-05-07T20:33:33.0049753Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:33.0049857Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:33.0050035Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:33.0050642Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:33.0050747Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:33.0051131Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:33.0051363Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:33.0051757Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:33.0052060Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:33.0052490Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:33:33.0052757Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:33.0053157Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:33.0053371Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:33.0053735Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:33.0053813Z fn() 2025-05-07T20:33:33.0054244Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:33.0054326Z self.fn.run( 2025-05-07T20:33:33.0054690Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:33.0054780Z kernel = self.compile( 2025-05-07T20:33:33.0055187Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:33.0055371Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:33.0055501Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:33.0055508Z 2025-05-07T20:33:33.0055719Z self = 2025-05-07T20:33:33.0056565Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:33.0057114Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f58ff2d2b80>} 2025-05-07T20:33:33.0057929Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:33.0058126Z context = 2025-05-07T20:33:33.0058130Z 2025-05-07T20:33:33.0058302Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:33.0058581Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:33.0058686Z module_map=module_map) 2025-05-07T20:33:33.0058894Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:33.0059020Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:33.0059107Z E ^ 2025-05-07T20:33:33.0059502Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:33.0059510Z 2025-05-07T20:33:33.0059954Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:33.0059997Z 2025-05-07T20:33:33.0060100Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:33.0060326Z self=, 2025-05-07T20:33:33.0060398Z T=4096, 2025-05-07T20:33:33.0060488Z D=5120, 2025-05-07T20:33:33.0060569Z scale_ub=None, 2025-05-07T20:33:33.0060651Z contiguous=True, 2025-05-07T20:33:33.0060735Z compiled=True, 2025-05-07T20:33:33.0060805Z ) 2025-05-07T20:33:33.0061032Z self = 2025-05-07T20:33:33.0061206Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:33.0061211Z 2025-05-07T20:33:33.0061282Z @given( 2025-05-07T20:33:33.0061398Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:33.0061541Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:33.0061656Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:33.0061777Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:33.0061891Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:33.0061960Z ) 2025-05-07T20:33:33.0062218Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:33.0062383Z def test_silu_mul_quant( 2025-05-07T20:33:33.0062458Z self, 2025-05-07T20:33:33.0062536Z T: int, 2025-05-07T20:33:33.0062610Z D: int, 2025-05-07T20:33:33.0062707Z scale_ub: Optional[float], 2025-05-07T20:33:33.0062802Z contiguous: bool, 2025-05-07T20:33:33.0062884Z compiled: bool, 2025-05-07T20:33:33.0062958Z ) -> None: 2025-05-07T20:33:33.0063053Z torch.manual_seed(2025) 2025-05-07T20:33:33.0063121Z 2025-05-07T20:33:33.0063297Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:33.0063372Z 2025-05-07T20:33:33.0063461Z x_sign = torch.sign(x) 2025-05-07T20:33:33.0063585Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:33.0063673Z x = x_sign * x_clamp 2025-05-07T20:33:33.0063750Z x0 = x[:, :D] 2025-05-07T20:33:33.0063830Z x1 = x[:, D:] 2025-05-07T20:33:33.0063902Z 2025-05-07T20:33:33.0063986Z if contiguous: 2025-05-07T20:33:33.0064083Z x0 = x0.contiguous() 2025-05-07T20:33:33.0064170Z x1 = x1.contiguous() 2025-05-07T20:33:33.0064240Z 2025-05-07T20:33:33.0064331Z if scale_ub is not None: 2025-05-07T20:33:33.0064435Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:33.0064572Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:33.0064646Z ) 2025-05-07T20:33:33.0064723Z else: 2025-05-07T20:33:33.0064816Z scale_ub_tensor = None 2025-05-07T20:33:33.0064887Z 2025-05-07T20:33:33.0065015Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:33.0065103Z op = silu_mul_quant 2025-05-07T20:33:33.0065186Z if compiled: 2025-05-07T20:33:33.0065286Z op = torch.compile(op) 2025-05-07T20:33:33.0065392Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:33.0065463Z 2025-05-07T20:33:33.0065554Z y_fp8, y_scale = fn() 2025-05-07T20:33:33.0065682Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:33.0065749Z 2025-05-07T20:33:33.0065884Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:33.0066033Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:33.0066133Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:33.0066259Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:33.0066399Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:33.0066471Z 2025-05-07T20:33:33.0066574Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:33.0066579Z 2025-05-07T20:33:33.0066674Z moe/activation_test.py:126: 2025-05-07T20:33:33.0066802Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:33.0066950Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:33.0067084Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:33.0067693Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:33.0067792Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:33.0068181Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:33.0068415Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:33.0068846Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:33.0069118Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:33.0069545Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:33:33.0069926Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:33.0070332Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:33.0070545Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:33.0070913Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:33.0070997Z fn() 2025-05-07T20:33:33.0071426Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:33.0071509Z self.fn.run( 2025-05-07T20:33:33.0071870Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:33.0071960Z kernel = self.compile( 2025-05-07T20:33:33.0072372Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:33.0072548Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:33.0072679Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:33.0072688Z 2025-05-07T20:33:33.0072897Z self = 2025-05-07T20:33:33.0073744Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:33.0074294Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f58feddf5e0>} 2025-05-07T20:33:33.0075104Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:33.0075303Z context = 2025-05-07T20:33:33.0075311Z 2025-05-07T20:33:33.0075478Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:33.0075794Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:33.0075901Z module_map=module_map) 2025-05-07T20:33:33.0076064Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:33.0076165Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:33.0076241Z E ^ 2025-05-07T20:33:33.0076623Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:33.0076628Z 2025-05-07T20:33:33.0077073Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:33.0077117Z 2025-05-07T20:33:33.0077217Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:33.0077446Z self=, 2025-05-07T20:33:33.0077528Z T=16384, 2025-05-07T20:33:33.0077602Z D=5120, 2025-05-07T20:33:33.0077684Z scale_ub=None, 2025-05-07T20:33:33.0077764Z contiguous=True, 2025-05-07T20:33:33.0077845Z compiled=True, 2025-05-07T20:33:33.0077917Z ) 2025-05-07T20:33:33.0078143Z self = 2025-05-07T20:33:33.0078319Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:33.0078324Z 2025-05-07T20:33:33.0078444Z @given( 2025-05-07T20:33:33.0078562Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:33.0078661Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:33.0078780Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:33.0078922Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:33.0079052Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:33.0079130Z ) 2025-05-07T20:33:33.0079427Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:33.0079521Z def test_silu_mul_quant( 2025-05-07T20:33:33.0079596Z self, 2025-05-07T20:33:33.0079678Z T: int, 2025-05-07T20:33:33.0079756Z D: int, 2025-05-07T20:33:33.0079854Z scale_ub: Optional[float], 2025-05-07T20:33:33.0079941Z contiguous: bool, 2025-05-07T20:33:33.0080029Z compiled: bool, 2025-05-07T20:33:33.0080106Z ) -> None: 2025-05-07T20:33:33.0080199Z torch.manual_seed(2025) 2025-05-07T20:33:33.0080278Z 2025-05-07T20:33:33.0080446Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:33.0082007Z 2025-05-07T20:33:33.0082099Z x_sign = torch.sign(x) 2025-05-07T20:33:33.0082222Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:33.0082311Z x = x_sign * x_clamp 2025-05-07T20:33:33.0082390Z x0 = x[:, :D] 2025-05-07T20:33:33.0082469Z x1 = x[:, D:] 2025-05-07T20:33:33.0082547Z 2025-05-07T20:33:33.0082628Z if contiguous: 2025-05-07T20:33:33.0082716Z x0 = x0.contiguous() 2025-05-07T20:33:33.0083052Z x1 = x1.contiguous() 2025-05-07T20:33:33.0083132Z 2025-05-07T20:33:33.0083223Z if scale_ub is not None: 2025-05-07T20:33:33.0083331Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:33.0083467Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:33.0083541Z ) 2025-05-07T20:33:33.0083619Z else: 2025-05-07T20:33:33.0083714Z scale_ub_tensor = None 2025-05-07T20:33:33.0083790Z 2025-05-07T20:33:33.0083919Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:33.0084012Z op = silu_mul_quant 2025-05-07T20:33:33.0084100Z if compiled: 2025-05-07T20:33:33.0084200Z op = torch.compile(op) 2025-05-07T20:33:33.0084303Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:33.0084380Z 2025-05-07T20:33:33.0084469Z y_fp8, y_scale = fn() 2025-05-07T20:33:33.0084588Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:33.0084665Z 2025-05-07T20:33:33.0084883Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:33.0084988Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:33.0085092Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:33.0085220Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:33.0085376Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:33.0085447Z 2025-05-07T20:33:33.0085548Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:33.0085552Z 2025-05-07T20:33:33.0085709Z moe/activation_test.py:126: 2025-05-07T20:33:33.0085837Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:33.0085943Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:33.0086078Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:33.0086687Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:33.0086792Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:33.0087176Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:33.0087408Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:33.0087863Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:33.0088130Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:33.0088561Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:33:33.0088827Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:33.0089287Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:33.0089461Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:33.0089824Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:33.0089900Z fn() 2025-05-07T20:33:33.0090336Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:33.0090413Z self.fn.run( 2025-05-07T20:33:33.0090775Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:33.0090868Z kernel = self.compile( 2025-05-07T20:33:33.0091273Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:33.0091459Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:33.0091587Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:33.0091591Z 2025-05-07T20:33:33.0091808Z self = 2025-05-07T20:33:33.0092659Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:33.0093206Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f58ff41bee0>} 2025-05-07T20:33:33.0094027Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:33.0094227Z context = 2025-05-07T20:33:33.0094232Z 2025-05-07T20:33:33.0094445Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:33.0094723Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:33.0094828Z module_map=module_map) 2025-05-07T20:33:33.0094993Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:33.0095097Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:33.0095170Z E ^ 2025-05-07T20:33:33.0095557Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:33.0095602Z 2025-05-07T20:33:33.0096047Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:33.0096052Z 2025-05-07T20:33:33.0096162Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:33.0096390Z self=, 2025-05-07T20:33:33.0096463Z T=1, 2025-05-07T20:33:33.0096538Z D=5120, 2025-05-07T20:33:33.0096623Z scale_ub=1200.0, 2025-05-07T20:33:33.0096710Z contiguous=True, 2025-05-07T20:33:33.0096792Z compiled=True, 2025-05-07T20:33:33.0096866Z ) 2025-05-07T20:33:33.0097092Z self = 2025-05-07T20:33:33.0097329Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:33.0097334Z 2025-05-07T20:33:33.0097408Z @given( 2025-05-07T20:33:33.0097531Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:33.0097631Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:33.0097746Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:33.0097864Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:33.0098016Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:33.0098092Z ) 2025-05-07T20:33:33.0098350Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:33.0098445Z def test_silu_mul_quant( 2025-05-07T20:33:33.0098523Z self, 2025-05-07T20:33:33.0098599Z T: int, 2025-05-07T20:33:33.0098672Z D: int, 2025-05-07T20:33:33.0098769Z scale_ub: Optional[float], 2025-05-07T20:33:33.0098856Z contiguous: bool, 2025-05-07T20:33:33.0098938Z compiled: bool, 2025-05-07T20:33:33.0099019Z ) -> None: 2025-05-07T20:33:33.0099111Z torch.manual_seed(2025) 2025-05-07T20:33:33.0099182Z 2025-05-07T20:33:33.0099358Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:33.0099429Z 2025-05-07T20:33:33.0099520Z x_sign = torch.sign(x) 2025-05-07T20:33:33.0099648Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:33.0099734Z x = x_sign * x_clamp 2025-05-07T20:33:33.0099818Z x0 = x[:, :D] 2025-05-07T20:33:33.0099898Z x1 = x[:, D:] 2025-05-07T20:33:33.0099966Z 2025-05-07T20:33:33.0100049Z if contiguous: 2025-05-07T20:33:33.0100142Z x0 = x0.contiguous() 2025-05-07T20:33:33.0100230Z x1 = x1.contiguous() 2025-05-07T20:33:33.0100304Z 2025-05-07T20:33:33.0100394Z if scale_ub is not None: 2025-05-07T20:33:33.0100496Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:33.0100633Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:33.0100704Z ) 2025-05-07T20:33:33.0100778Z else: 2025-05-07T20:33:33.0100870Z scale_ub_tensor = None 2025-05-07T20:33:33.0100939Z 2025-05-07T20:33:33.0101072Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:33.0101160Z op = silu_mul_quant 2025-05-07T20:33:33.0101241Z if compiled: 2025-05-07T20:33:33.0101346Z op = torch.compile(op) 2025-05-07T20:33:33.0101452Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:33.0101523Z 2025-05-07T20:33:33.0101613Z > y_fp8, y_scale = fn() 2025-05-07T20:33:33.0101618Z 2025-05-07T20:33:33.0101758Z moe/activation_test.py:117: 2025-05-07T20:33:33.0101889Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:33.0101993Z moe/activation_test.py:115: in fn 2025-05-07T20:33:33.0102093Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:33.0102486Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:33.0102577Z return fn(*args, **kwargs) 2025-05-07T20:33:33.0103114Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:33.0103252Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:33.0103631Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:33.0103866Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:33.0104233Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:33.0104326Z kernel = self.compile( 2025-05-07T20:33:33.0104736Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:33.0104949Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:33.0105080Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:33.0105085Z 2025-05-07T20:33:33.0105298Z self = 2025-05-07T20:33:33.0106142Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:33.0106728Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f58fedaf700>} 2025-05-07T20:33:33.0107535Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:33.0107733Z context = 2025-05-07T20:33:33.0107738Z 2025-05-07T20:33:33.0107904Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:33.0108180Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:33.0108287Z module_map=module_map) 2025-05-07T20:33:33.0108452Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:33.0108550Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:33.0108630Z E ^ 2025-05-07T20:33:33.0109055Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:33.0109062Z 2025-05-07T20:33:33.0109512Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:33.0109517Z 2025-05-07T20:33:33.0109616Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:33.0109927Z self=, 2025-05-07T20:33:33.0110007Z T=1, 2025-05-07T20:33:33.0110081Z D=5120, 2025-05-07T20:33:33.0110162Z scale_ub=None, 2025-05-07T20:33:33.0110250Z contiguous=False, 2025-05-07T20:33:33.0110336Z compiled=True, 2025-05-07T20:33:33.0110412Z ) 2025-05-07T20:33:33.0110647Z self = 2025-05-07T20:33:33.0110822Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:33.0110826Z 2025-05-07T20:33:33.0110909Z @given( 2025-05-07T20:33:33.0111075Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:33.0111175Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:33.0111300Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:33.0111417Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:33.0111536Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:33.0111619Z ) 2025-05-07T20:33:33.0111878Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:33.0111977Z def test_silu_mul_quant( 2025-05-07T20:33:33.0112096Z self, 2025-05-07T20:33:33.0112172Z T: int, 2025-05-07T20:33:33.0112256Z D: int, 2025-05-07T20:33:33.0112357Z scale_ub: Optional[float], 2025-05-07T20:33:33.0112447Z contiguous: bool, 2025-05-07T20:33:33.0112544Z compiled: bool, 2025-05-07T20:33:33.0112624Z ) -> None: 2025-05-07T20:33:33.0112721Z torch.manual_seed(2025) 2025-05-07T20:33:33.0112804Z 2025-05-07T20:33:33.0112982Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:33.0113060Z 2025-05-07T20:33:33.0113160Z x_sign = torch.sign(x) 2025-05-07T20:33:33.0113286Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:33.0113376Z x = x_sign * x_clamp 2025-05-07T20:33:33.0113503Z x0 = x[:, :D] 2025-05-07T20:33:33.0113584Z x1 = x[:, D:] 2025-05-07T20:33:33.0113664Z 2025-05-07T20:33:33.0113747Z if contiguous: 2025-05-07T20:33:33.0113842Z x0 = x0.contiguous() 2025-05-07T20:33:33.0113941Z x1 = x1.contiguous() 2025-05-07T20:33:33.0114012Z 2025-05-07T20:33:33.0114104Z if scale_ub is not None: 2025-05-07T20:33:33.0114216Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:33.0114393Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:33.0114470Z ) 2025-05-07T20:33:33.0114554Z else: 2025-05-07T20:33:33.0114651Z scale_ub_tensor = None 2025-05-07T20:33:33.0114727Z 2025-05-07T20:33:33.0114860Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:33.0114951Z op = silu_mul_quant 2025-05-07T20:33:33.0115044Z if compiled: 2025-05-07T20:33:33.0115146Z op = torch.compile(op) 2025-05-07T20:33:33.0115254Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:33.0115335Z 2025-05-07T20:33:33.0115427Z y_fp8, y_scale = fn() 2025-05-07T20:33:33.0115548Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:33.0115632Z 2025-05-07T20:33:33.0115769Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:33.0115873Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:33.0115981Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:33.0116103Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:33.0116253Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:33.0116328Z 2025-05-07T20:33:33.0116428Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:33.0116432Z 2025-05-07T20:33:33.0116536Z moe/activation_test.py:126: 2025-05-07T20:33:33.0116668Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:33.0116776Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:33.0116921Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:33.0117531Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:33.0117633Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:33.0118025Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:33.0118260Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:33.0118711Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:33.0118985Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:33.0119415Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:33:33.0119688Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:33.0120092Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:33.0120310Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:33.0120674Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:33.0120754Z fn() 2025-05-07T20:33:33.0121195Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:33.0121280Z self.fn.run( 2025-05-07T20:33:33.0121640Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:33.0121741Z kernel = self.compile( 2025-05-07T20:33:33.0122191Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:33.0122375Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:33.0122511Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:33.0122519Z 2025-05-07T20:33:33.0122731Z self = 2025-05-07T20:33:33.0123587Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:33.0124177Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f58fe6803a0>} 2025-05-07T20:33:33.0125005Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:33.0125203Z context = 2025-05-07T20:33:33.0125212Z 2025-05-07T20:33:33.0125386Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:33.0125665Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:33.0125775Z module_map=module_map) 2025-05-07T20:33:33.0125948Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:33.0126055Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:33.0126134Z E ^ 2025-05-07T20:33:33.0126528Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:33.0126533Z 2025-05-07T20:33:33.0126987Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:33.0126991Z 2025-05-07T20:33:33.0130831Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:33.0131085Z self=, 2025-05-07T20:33:33.0131174Z T=1, 2025-05-07T20:33:33.0131254Z D=5120, 2025-05-07T20:33:33.0131340Z scale_ub=None, 2025-05-07T20:33:33.0131429Z contiguous=True, 2025-05-07T20:33:33.0131516Z compiled=False, 2025-05-07T20:33:33.0131595Z ) 2025-05-07T20:33:33.0131820Z self = 2025-05-07T20:33:33.0131989Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:33.0132061Z 2025-05-07T20:33:33.0132136Z @given( 2025-05-07T20:33:33.0132266Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:33.0132365Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:33.0132481Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:33.0132598Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:33.0132711Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:33.0132790Z ) 2025-05-07T20:33:33.0133053Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:33.0133916Z def test_silu_mul_quant( 2025-05-07T20:33:33.0134002Z self, 2025-05-07T20:33:33.0134076Z T: int, 2025-05-07T20:33:33.0134155Z D: int, 2025-05-07T20:33:33.0134263Z scale_ub: Optional[float], 2025-05-07T20:33:33.0134352Z contiguous: bool, 2025-05-07T20:33:33.0134438Z compiled: bool, 2025-05-07T20:33:33.0134522Z ) -> None: 2025-05-07T20:33:33.0134620Z torch.manual_seed(2025) 2025-05-07T20:33:33.0134697Z 2025-05-07T20:33:33.0134872Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:33.0134951Z 2025-05-07T20:33:33.0135048Z x_sign = torch.sign(x) 2025-05-07T20:33:33.0135221Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:33.0135313Z x = x_sign * x_clamp 2025-05-07T20:33:33.0135397Z x0 = x[:, :D] 2025-05-07T20:33:33.0135475Z x1 = x[:, D:] 2025-05-07T20:33:33.0135552Z 2025-05-07T20:33:33.0135641Z if contiguous: 2025-05-07T20:33:33.0135733Z x0 = x0.contiguous() 2025-05-07T20:33:33.0135824Z x1 = x1.contiguous() 2025-05-07T20:33:33.0135899Z 2025-05-07T20:33:33.0136035Z if scale_ub is not None: 2025-05-07T20:33:33.0136145Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:33.0136287Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:33.0136363Z ) 2025-05-07T20:33:33.0136438Z else: 2025-05-07T20:33:33.0136531Z scale_ub_tensor = None 2025-05-07T20:33:33.0136602Z 2025-05-07T20:33:33.0136734Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:33.0136825Z op = silu_mul_quant 2025-05-07T20:33:33.0136910Z if compiled: 2025-05-07T20:33:33.0137016Z op = torch.compile(op) 2025-05-07T20:33:33.0137120Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:33.0137195Z 2025-05-07T20:33:33.0137292Z > y_fp8, y_scale = fn() 2025-05-07T20:33:33.0137297Z 2025-05-07T20:33:33.0137394Z moe/activation_test.py:117: 2025-05-07T20:33:33.0137530Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:33.0137635Z moe/activation_test.py:115: in fn 2025-05-07T20:33:33.0137735Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:33.0138289Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:33.0138387Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:33.0138770Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:33.0139026Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:33.0139423Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:33.0139522Z kernel = self.compile( 2025-05-07T20:33:33.0139932Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:33.0140109Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:33.0140243Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:33.0140248Z 2025-05-07T20:33:33.0140502Z self = 2025-05-07T20:33:33.0141352Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:33.0141897Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f58fe643940>} 2025-05-07T20:33:33.0142716Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:33.0142957Z context = 2025-05-07T20:33:33.0142964Z 2025-05-07T20:33:33.0143133Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:33.0143418Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:33.0143525Z module_map=module_map) 2025-05-07T20:33:33.0143692Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:33.0143791Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:33.0143909Z E ^ 2025-05-07T20:33:33.0144296Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:33.0144301Z 2025-05-07T20:33:33.0144750Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:33.0144755Z 2025-05-07T20:33:33.0144856Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:33.0145126Z self=, 2025-05-07T20:33:33.0145202Z T=128, 2025-05-07T20:33:33.0145278Z D=5120, 2025-05-07T20:33:33.0145358Z scale_ub=None, 2025-05-07T20:33:33.0145451Z contiguous=False, 2025-05-07T20:33:33.0145534Z compiled=True, 2025-05-07T20:33:33.0145605Z ) 2025-05-07T20:33:33.0145834Z self = 2025-05-07T20:33:33.0146006Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:33.0146014Z 2025-05-07T20:33:33.0146087Z @given( 2025-05-07T20:33:33.0146211Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:33.0146306Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:33.0146423Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:33.0146541Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:33.0146656Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:33.0146737Z ) 2025-05-07T20:33:33.0146998Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:33.0147090Z def test_silu_mul_quant( 2025-05-07T20:33:33.0147169Z self, 2025-05-07T20:33:33.0147242Z T: int, 2025-05-07T20:33:33.0147315Z D: int, 2025-05-07T20:33:33.0147413Z scale_ub: Optional[float], 2025-05-07T20:33:33.0147501Z contiguous: bool, 2025-05-07T20:33:33.0147585Z compiled: bool, 2025-05-07T20:33:33.0147664Z ) -> None: 2025-05-07T20:33:33.0147759Z torch.manual_seed(2025) 2025-05-07T20:33:33.0147832Z 2025-05-07T20:33:33.0148007Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:33.0148084Z 2025-05-07T20:33:33.0148178Z x_sign = torch.sign(x) 2025-05-07T20:33:33.0148301Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:33.0148388Z x = x_sign * x_clamp 2025-05-07T20:33:33.0148473Z x0 = x[:, :D] 2025-05-07T20:33:33.0148554Z x1 = x[:, D:] 2025-05-07T20:33:33.0148626Z 2025-05-07T20:33:33.0148708Z if contiguous: 2025-05-07T20:33:33.0148795Z x0 = x0.contiguous() 2025-05-07T20:33:33.0148931Z x1 = x1.contiguous() 2025-05-07T20:33:33.0149007Z 2025-05-07T20:33:33.0149096Z if scale_ub is not None: 2025-05-07T20:33:33.0149200Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:33.0149338Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:33.0149410Z ) 2025-05-07T20:33:33.0149485Z else: 2025-05-07T20:33:33.0149579Z scale_ub_tensor = None 2025-05-07T20:33:33.0149650Z 2025-05-07T20:33:33.0149870Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:33.0150004Z op = silu_mul_quant 2025-05-07T20:33:33.0150086Z if compiled: 2025-05-07T20:33:33.0150187Z op = torch.compile(op) 2025-05-07T20:33:33.0150291Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:33.0150364Z 2025-05-07T20:33:33.0150456Z > y_fp8, y_scale = fn() 2025-05-07T20:33:33.0150461Z 2025-05-07T20:33:33.0150556Z moe/activation_test.py:117: 2025-05-07T20:33:33.0150694Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:33.0150796Z moe/activation_test.py:115: in fn 2025-05-07T20:33:33.0150894Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:33.0151329Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:33.0151423Z return fn(*args, **kwargs) 2025-05-07T20:33:33.0151961Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:33.0152061Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:33.0152443Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:33.0152715Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:33.0153080Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:33.0153174Z kernel = self.compile( 2025-05-07T20:33:33.0153582Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:33.0153756Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:33.0153893Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:33.0153898Z 2025-05-07T20:33:33.0154110Z self = 2025-05-07T20:33:33.0154956Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:33.0155514Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f58fe4fc040>} 2025-05-07T20:33:33.0156324Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:33.0156527Z context = 2025-05-07T20:33:33.0156531Z 2025-05-07T20:33:33.0156700Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:33.0156975Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:33.0157088Z module_map=module_map) 2025-05-07T20:33:33.0157249Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:33.0157347Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:33.0157427Z E ^ 2025-05-07T20:33:33.0157848Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:33.0157853Z 2025-05-07T20:33:33.0158299Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:33.0158304Z 2025-05-07T20:33:33.0158404Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:33.0158635Z self=, 2025-05-07T20:33:33.0158714Z T=128, 2025-05-07T20:33:33.0158787Z D=7168, 2025-05-07T20:33:33.0158866Z scale_ub=1200.0, 2025-05-07T20:33:33.0158992Z contiguous=False, 2025-05-07T20:33:33.0159073Z compiled=False, 2025-05-07T20:33:33.0159145Z ) 2025-05-07T20:33:33.0159372Z self = 2025-05-07T20:33:33.0159551Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:33.0159555Z 2025-05-07T20:33:33.0159633Z @given( 2025-05-07T20:33:33.0159750Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:33.0159848Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:33.0159967Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:33.0160082Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:33.0160191Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:33.0160265Z ) 2025-05-07T20:33:33.0160561Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:33.0160656Z def test_silu_mul_quant( 2025-05-07T20:33:33.0160733Z self, 2025-05-07T20:33:33.0160808Z T: int, 2025-05-07T20:33:33.0160884Z D: int, 2025-05-07T20:33:33.0160981Z scale_ub: Optional[float], 2025-05-07T20:33:33.0161066Z contiguous: bool, 2025-05-07T20:33:33.0161193Z compiled: bool, 2025-05-07T20:33:33.0161271Z ) -> None: 2025-05-07T20:33:33.0161363Z torch.manual_seed(2025) 2025-05-07T20:33:33.0161435Z 2025-05-07T20:33:33.0161607Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:33.0161677Z 2025-05-07T20:33:33.0161769Z x_sign = torch.sign(x) 2025-05-07T20:33:33.0161893Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:33.0161984Z x = x_sign * x_clamp 2025-05-07T20:33:33.0162063Z x0 = x[:, :D] 2025-05-07T20:33:33.0162141Z x1 = x[:, D:] 2025-05-07T20:33:33.0162220Z 2025-05-07T20:33:33.0162301Z if contiguous: 2025-05-07T20:33:33.0162388Z x0 = x0.contiguous() 2025-05-07T20:33:33.0162481Z x1 = x1.contiguous() 2025-05-07T20:33:33.0162556Z 2025-05-07T20:33:33.0162648Z if scale_ub is not None: 2025-05-07T20:33:33.0162754Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:33.0162887Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:33.0162962Z ) 2025-05-07T20:33:33.0163040Z else: 2025-05-07T20:33:33.0163134Z scale_ub_tensor = None 2025-05-07T20:33:33.0163207Z 2025-05-07T20:33:33.0163339Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:33.0163425Z op = silu_mul_quant 2025-05-07T20:33:33.0163510Z if compiled: 2025-05-07T20:33:33.0163610Z op = torch.compile(op) 2025-05-07T20:33:33.0163715Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:33.0163789Z 2025-05-07T20:33:33.0163879Z > y_fp8, y_scale = fn() 2025-05-07T20:33:33.0163883Z 2025-05-07T20:33:33.0163978Z moe/activation_test.py:117: 2025-05-07T20:33:33.0164113Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:33.0164209Z moe/activation_test.py:115: in fn 2025-05-07T20:33:33.0164304Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:33.0164852Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:33.0164946Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:33.0165400Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:33.0165634Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:33.0165997Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:33.0166092Z kernel = self.compile( 2025-05-07T20:33:33.0166498Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:33.0166717Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:33.0166846Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:33.0166853Z 2025-05-07T20:33:33.0167060Z self = 2025-05-07T20:33:33.0167919Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:33.0168500Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f58fe4fcd30>} 2025-05-07T20:33:33.0169315Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:33.0169511Z context = 2025-05-07T20:33:33.0169516Z 2025-05-07T20:33:33.0169680Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:33.0169996Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:33.0170107Z module_map=module_map) 2025-05-07T20:33:33.0170274Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:33.0170368Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:33.0170442Z E ^ 2025-05-07T20:33:33.0170826Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:33.0170831Z 2025-05-07T20:33:33.0171273Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:33.0171280Z 2025-05-07T20:33:33.0171382Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:33.0171612Z self=, 2025-05-07T20:33:33.0171687Z T=128, 2025-05-07T20:33:33.0171762Z D=5120, 2025-05-07T20:33:33.0171842Z scale_ub=None, 2025-05-07T20:33:33.0171925Z contiguous=False, 2025-05-07T20:33:33.0172008Z compiled=False, 2025-05-07T20:33:33.0172080Z ) 2025-05-07T20:33:33.0172305Z self = 2025-05-07T20:33:33.0172480Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:33.0172485Z 2025-05-07T20:33:33.0172556Z @given( 2025-05-07T20:33:33.0172676Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:33.0172776Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:33.0172889Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:33.0173006Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:33.0173119Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:33.0173193Z ) 2025-05-07T20:33:33.0173449Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:33.0173542Z def test_silu_mul_quant( 2025-05-07T20:33:33.0173617Z self, 2025-05-07T20:33:33.0173694Z T: int, 2025-05-07T20:33:33.0173767Z D: int, 2025-05-07T20:33:33.0173908Z scale_ub: Optional[float], 2025-05-07T20:33:33.0174000Z contiguous: bool, 2025-05-07T20:33:33.0174082Z compiled: bool, 2025-05-07T20:33:33.0174162Z ) -> None: 2025-05-07T20:33:33.0174256Z torch.manual_seed(2025) 2025-05-07T20:33:33.0174325Z 2025-05-07T20:33:33.0174499Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:33.0174571Z 2025-05-07T20:33:33.0174662Z x_sign = torch.sign(x) 2025-05-07T20:33:33.0174788Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:33.0174918Z x = x_sign * x_clamp 2025-05-07T20:33:33.0174996Z x0 = x[:, :D] 2025-05-07T20:33:33.0175077Z x1 = x[:, D:] 2025-05-07T20:33:33.0175148Z 2025-05-07T20:33:33.0175229Z if contiguous: 2025-05-07T20:33:33.0175325Z x0 = x0.contiguous() 2025-05-07T20:33:33.0175411Z x1 = x1.contiguous() 2025-05-07T20:33:33.0175488Z 2025-05-07T20:33:33.0175576Z if scale_ub is not None: 2025-05-07T20:33:33.0175683Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:33.0175818Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:33.0175891Z ) 2025-05-07T20:33:33.0175967Z else: 2025-05-07T20:33:33.0176060Z scale_ub_tensor = None 2025-05-07T20:33:33.0176127Z 2025-05-07T20:33:33.0176300Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:33.0176391Z op = silu_mul_quant 2025-05-07T20:33:33.0176474Z if compiled: 2025-05-07T20:33:33.0176574Z op = torch.compile(op) 2025-05-07T20:33:33.0176679Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:33.0176750Z 2025-05-07T20:33:33.0176839Z > y_fp8, y_scale = fn() 2025-05-07T20:33:33.0176886Z 2025-05-07T20:33:33.0176982Z moe/activation_test.py:117: 2025-05-07T20:33:33.0177112Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:33.0177218Z moe/activation_test.py:115: in fn 2025-05-07T20:33:33.0177316Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:33.0177856Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:33.0177956Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:33.0178344Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:33.0178580Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:33.0178973Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:33.0179085Z kernel = self.compile( 2025-05-07T20:33:33.0179502Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:33.0179679Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:33.0179808Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:33.0179813Z 2025-05-07T20:33:33.0180024Z self = 2025-05-07T20:33:33.0180871Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:33.0181419Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f58fdde2310>} 2025-05-07T20:33:33.0182233Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:33.0182473Z context = 2025-05-07T20:33:33.0182478Z 2025-05-07T20:33:33.0182645Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:33.0183175Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:33.0183301Z module_map=module_map) 2025-05-07T20:33:33.0183466Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:33.0183563Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:33.0183640Z E ^ 2025-05-07T20:33:33.0184144Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:33.0184149Z 2025-05-07T20:33:33.0184597Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:33.0184605Z 2025-05-07T20:33:33.0184706Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:33.0184940Z self=, 2025-05-07T20:33:33.0185018Z T=128, 2025-05-07T20:33:33.0185095Z D=5120, 2025-05-07T20:33:33.0185178Z scale_ub=1200.0, 2025-05-07T20:33:33.0185266Z contiguous=True, 2025-05-07T20:33:33.0185350Z compiled=False, 2025-05-07T20:33:33.0185422Z ) 2025-05-07T20:33:33.0185708Z self = 2025-05-07T20:33:33.0185886Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:33.0185894Z 2025-05-07T20:33:33.0185975Z @given( 2025-05-07T20:33:33.0186090Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:33.0186187Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:33.0186307Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:33.0186482Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:33.0186595Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:33.0186669Z ) 2025-05-07T20:33:33.0186928Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:33.0187023Z def test_silu_mul_quant( 2025-05-07T20:33:33.0187099Z self, 2025-05-07T20:33:33.0187175Z T: int, 2025-05-07T20:33:33.0187253Z D: int, 2025-05-07T20:33:33.0187355Z scale_ub: Optional[float], 2025-05-07T20:33:33.0187442Z contiguous: bool, 2025-05-07T20:33:33.0187532Z compiled: bool, 2025-05-07T20:33:33.0187609Z ) -> None: 2025-05-07T20:33:33.0187704Z torch.manual_seed(2025) 2025-05-07T20:33:33.0187780Z 2025-05-07T20:33:33.0187951Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:33.0188023Z 2025-05-07T20:33:33.0188118Z x_sign = torch.sign(x) 2025-05-07T20:33:33.0188245Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:33.0188336Z x = x_sign * x_clamp 2025-05-07T20:33:33.0188416Z x0 = x[:, :D] 2025-05-07T20:33:33.0188500Z x1 = x[:, D:] 2025-05-07T20:33:33.0188577Z 2025-05-07T20:33:33.0188656Z if contiguous: 2025-05-07T20:33:33.0188745Z x0 = x0.contiguous() 2025-05-07T20:33:33.0188834Z x1 = x1.contiguous() 2025-05-07T20:33:33.0188902Z 2025-05-07T20:33:33.0188992Z if scale_ub is not None: 2025-05-07T20:33:33.0189099Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:33.0189233Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:33.0189310Z ) 2025-05-07T20:33:33.0189393Z else: 2025-05-07T20:33:33.0189486Z scale_ub_tensor = None 2025-05-07T20:33:33.0189561Z 2025-05-07T20:33:33.0189699Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:33.0189872Z op = silu_mul_quant 2025-05-07T20:33:33.0189958Z if compiled: 2025-05-07T20:33:33.0190059Z op = torch.compile(op) 2025-05-07T20:33:33.0190162Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:33.0190307Z 2025-05-07T20:33:33.0190399Z > y_fp8, y_scale = fn() 2025-05-07T20:33:33.0190404Z 2025-05-07T20:33:33.0190498Z moe/activation_test.py:117: 2025-05-07T20:33:33.0190633Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:33.0190732Z moe/activation_test.py:115: in fn 2025-05-07T20:33:33.0190831Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:33.0191378Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:33.0191513Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:33.0191898Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:33.0192137Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:33.0192503Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:33.0192597Z kernel = self.compile( 2025-05-07T20:33:33.0193008Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:33.0193187Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:33.0193354Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:33.0193359Z 2025-05-07T20:33:33.0193569Z self = 2025-05-07T20:33:33.0194423Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:33.0195037Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f58fdde2ee0>} 2025-05-07T20:33:33.0195852Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:33.0196051Z context = 2025-05-07T20:33:33.0196055Z 2025-05-07T20:33:33.0196224Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:33.0196506Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:33.0196613Z module_map=module_map) 2025-05-07T20:33:33.0196775Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:33.0196880Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:33.0196957Z E ^ 2025-05-07T20:33:33.0197341Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:33.0197348Z 2025-05-07T20:33:33.0197798Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:33.0197803Z 2025-05-07T20:33:33.0197904Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:33.0198139Z self=, 2025-05-07T20:33:33.0198215Z T=1, 2025-05-07T20:33:33.0198288Z D=7168, 2025-05-07T20:33:33.0198372Z scale_ub=1200.0, 2025-05-07T20:33:33.0198456Z contiguous=True, 2025-05-07T20:33:33.0198538Z compiled=True, 2025-05-07T20:33:33.0198612Z ) 2025-05-07T20:33:33.0198838Z self = 2025-05-07T20:33:33.0199007Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:33.0199015Z 2025-05-07T20:33:33.0199089Z @given( 2025-05-07T20:33:33.0199206Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:33.0199348Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:33.0199464Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:33.0199579Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:33.0199693Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:33.0199768Z ) 2025-05-07T20:33:33.0200026Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:33.0200123Z def test_silu_mul_quant( 2025-05-07T20:33:33.0200196Z self, 2025-05-07T20:33:33.0200315Z T: int, 2025-05-07T20:33:33.0200392Z D: int, 2025-05-07T20:33:33.0200488Z scale_ub: Optional[float], 2025-05-07T20:33:33.0200581Z contiguous: bool, 2025-05-07T20:33:33.0200665Z compiled: bool, 2025-05-07T20:33:33.0200741Z ) -> None: 2025-05-07T20:33:33.0200840Z torch.manual_seed(2025) 2025-05-07T20:33:33.0200912Z 2025-05-07T20:33:33.0201085Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:33.0201156Z 2025-05-07T20:33:33.0201245Z x_sign = torch.sign(x) 2025-05-07T20:33:33.0201367Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:33.0201460Z x = x_sign * x_clamp 2025-05-07T20:33:33.0201537Z x0 = x[:, :D] 2025-05-07T20:33:33.0201659Z x1 = x[:, D:] 2025-05-07T20:33:33.0201732Z 2025-05-07T20:33:33.0201812Z if contiguous: 2025-05-07T20:33:33.0201902Z x0 = x0.contiguous() 2025-05-07T20:33:33.0201988Z x1 = x1.contiguous() 2025-05-07T20:33:33.0202062Z 2025-05-07T20:33:33.0202152Z if scale_ub is not None: 2025-05-07T20:33:33.0202254Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:33.0202388Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:33.0202507Z ) 2025-05-07T20:33:33.0202579Z else: 2025-05-07T20:33:33.0202669Z scale_ub_tensor = None 2025-05-07T20:33:33.0202742Z 2025-05-07T20:33:33.0202871Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:33.0202957Z op = silu_mul_quant 2025-05-07T20:33:33.0203040Z if compiled: 2025-05-07T20:33:33.0203137Z op = torch.compile(op) 2025-05-07T20:33:33.0203245Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:33.0203317Z 2025-05-07T20:33:33.0203405Z > y_fp8, y_scale = fn() 2025-05-07T20:33:33.0203410Z 2025-05-07T20:33:33.0203508Z moe/activation_test.py:117: 2025-05-07T20:33:33.0203643Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:33.0203740Z moe/activation_test.py:115: in fn 2025-05-07T20:33:33.0203841Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:33.0204239Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:33.0204333Z return fn(*args, **kwargs) 2025-05-07T20:33:33.0204873Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:33.0204968Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:33.0205354Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:33.0205589Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:33.0205951Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:33.0206051Z kernel = self.compile( 2025-05-07T20:33:33.0206460Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:33.0206640Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:33.0206771Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:33.0206775Z 2025-05-07T20:33:33.0207030Z self = 2025-05-07T20:33:33.0207896Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:33.0208442Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f58fe252940>} 2025-05-07T20:33:33.0209349Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:33.0209549Z context = 2025-05-07T20:33:33.0209554Z 2025-05-07T20:33:33.0209719Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:33.0210003Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:33.0210109Z module_map=module_map) 2025-05-07T20:33:33.0210272Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:33.0210407Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:33.0210482Z E ^ 2025-05-07T20:33:33.0210863Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:33.0210870Z 2025-05-07T20:33:33.0211319Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:33.0211324Z 2025-05-07T20:33:33.0211427Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:33.0211699Z self=, 2025-05-07T20:33:33.0211771Z T=1, 2025-05-07T20:33:33.0211847Z D=7168, 2025-05-07T20:33:33.0211929Z scale_ub=1200.0, 2025-05-07T20:33:33.0212015Z contiguous=False, 2025-05-07T20:33:33.0212099Z compiled=True, 2025-05-07T20:33:33.0212175Z ) 2025-05-07T20:33:33.0212399Z self = 2025-05-07T20:33:33.0212572Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:33.0212577Z 2025-05-07T20:33:33.0212650Z @given( 2025-05-07T20:33:33.0212774Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:33.0212872Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:33.0212987Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:33.0213105Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:33.0213217Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:33.0213297Z ) 2025-05-07T20:33:33.0213558Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:33.0213652Z def test_silu_mul_quant( 2025-05-07T20:33:33.0213729Z self, 2025-05-07T20:33:33.0213808Z T: int, 2025-05-07T20:33:33.0213880Z D: int, 2025-05-07T20:33:33.0213979Z scale_ub: Optional[float], 2025-05-07T20:33:33.0214067Z contiguous: bool, 2025-05-07T20:33:33.0214153Z compiled: bool, 2025-05-07T20:33:33.0214237Z ) -> None: 2025-05-07T20:33:33.0214331Z torch.manual_seed(2025) 2025-05-07T20:33:33.0214404Z 2025-05-07T20:33:33.0214579Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:33.0214653Z 2025-05-07T20:33:33.0214747Z x_sign = torch.sign(x) 2025-05-07T20:33:33.0214875Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:33.0214964Z x = x_sign * x_clamp 2025-05-07T20:33:33.0215047Z x0 = x[:, :D] 2025-05-07T20:33:33.0215131Z x1 = x[:, D:] 2025-05-07T20:33:33.0215204Z 2025-05-07T20:33:33.0215286Z if contiguous: 2025-05-07T20:33:33.0215427Z x0 = x0.contiguous() 2025-05-07T20:33:33.0215518Z x1 = x1.contiguous() 2025-05-07T20:33:33.0215591Z 2025-05-07T20:33:33.0215681Z if scale_ub is not None: 2025-05-07T20:33:33.0215784Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:33.0215923Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:33.0215998Z ) 2025-05-07T20:33:33.0216071Z else: 2025-05-07T20:33:33.0216169Z scale_ub_tensor = None 2025-05-07T20:33:33.0216241Z 2025-05-07T20:33:33.0216411Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:33.0216505Z op = silu_mul_quant 2025-05-07T20:33:33.0216587Z if compiled: 2025-05-07T20:33:33.0216685Z op = torch.compile(op) 2025-05-07T20:33:33.0216796Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:33.0216867Z 2025-05-07T20:33:33.0216958Z > y_fp8, y_scale = fn() 2025-05-07T20:33:33.0216962Z 2025-05-07T20:33:33.0217059Z moe/activation_test.py:117: 2025-05-07T20:33:33.0217189Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:33.0217290Z moe/activation_test.py:115: in fn 2025-05-07T20:33:33.0217388Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:33.0217822Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:33.0217919Z return fn(*args, **kwargs) 2025-05-07T20:33:33.0218457Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:33.0218558Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:33.0218939Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:33.0219211Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:33.0219578Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:33.0219671Z kernel = self.compile( 2025-05-07T20:33:33.0220078Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:33.0220262Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:33.0220390Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:33.0220395Z 2025-05-07T20:33:33.0220608Z self = 2025-05-07T20:33:33.0221458Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:33.0222010Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f58fe1f15e0>} 2025-05-07T20:33:33.0222825Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:33.0223022Z context = 2025-05-07T20:33:33.0223026Z 2025-05-07T20:33:33.0223197Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:33.0223473Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:33.0223582Z module_map=module_map) 2025-05-07T20:33:33.0223744Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:33.0223843Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:33.0223917Z E ^ 2025-05-07T20:33:33.0224339Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:33.0224344Z 2025-05-07T20:33:33.0224791Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:33.0224800Z 2025-05-07T20:33:33.0224902Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:33.0225134Z self=, 2025-05-07T20:33:33.0225211Z T=1, 2025-05-07T20:33:33.0225287Z D=7168, 2025-05-07T20:33:33.0225406Z scale_ub=None, 2025-05-07T20:33:33.0225498Z contiguous=False, 2025-05-07T20:33:33.0225580Z compiled=True, 2025-05-07T20:33:33.0225653Z ) 2025-05-07T20:33:33.0225879Z self = 2025-05-07T20:33:33.0226048Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:33.0226053Z 2025-05-07T20:33:33.0226131Z @given( 2025-05-07T20:33:33.0226257Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:33.0226354Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:33.0226472Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:33.0226585Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:33.0226697Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:33.0226835Z ) 2025-05-07T20:33:33.0227096Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:33.0227187Z def test_silu_mul_quant( 2025-05-07T20:33:33.0227265Z self, 2025-05-07T20:33:33.0227340Z T: int, 2025-05-07T20:33:33.0227415Z D: int, 2025-05-07T20:33:33.0227514Z scale_ub: Optional[float], 2025-05-07T20:33:33.0227605Z contiguous: bool, 2025-05-07T20:33:33.0227736Z compiled: bool, 2025-05-07T20:33:33.0227814Z ) -> None: 2025-05-07T20:33:33.0227907Z torch.manual_seed(2025) 2025-05-07T20:33:33.0227979Z 2025-05-07T20:33:33.0228151Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:33.0228222Z 2025-05-07T20:33:33.0228314Z x_sign = torch.sign(x) 2025-05-07T20:33:33.0228438Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:33.0228524Z x = x_sign * x_clamp 2025-05-07T20:33:33.0228608Z x0 = x[:, :D] 2025-05-07T20:33:33.0228687Z x1 = x[:, D:] 2025-05-07T20:33:33.0228761Z 2025-05-07T20:33:33.0228852Z if contiguous: 2025-05-07T20:33:33.0228943Z x0 = x0.contiguous() 2025-05-07T20:33:33.0229033Z x1 = x1.contiguous() 2025-05-07T20:33:33.0229108Z 2025-05-07T20:33:33.0229197Z if scale_ub is not None: 2025-05-07T20:33:33.0229304Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:33.0229441Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:33.0229516Z ) 2025-05-07T20:33:33.0229593Z else: 2025-05-07T20:33:33.0229687Z scale_ub_tensor = None 2025-05-07T20:33:33.0229834Z 2025-05-07T20:33:33.0229968Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:33.0230057Z op = silu_mul_quant 2025-05-07T20:33:33.0230139Z if compiled: 2025-05-07T20:33:33.0230241Z op = torch.compile(op) 2025-05-07T20:33:33.0230349Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:33.0230417Z 2025-05-07T20:33:33.0230511Z y_fp8, y_scale = fn() 2025-05-07T20:33:33.0230630Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:33.0230711Z 2025-05-07T20:33:33.0230847Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:33.0230947Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:33.0231050Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:33.0231173Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:33.0231312Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:33.0231436Z 2025-05-07T20:33:33.0231535Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:33.0231540Z 2025-05-07T20:33:33.0231636Z moe/activation_test.py:126: 2025-05-07T20:33:33.0231769Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:33.0231874Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:33.0232011Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:33.0232619Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:33.0232761Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:33.0233147Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:33.0233381Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:33.0233777Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:33.0234044Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:33.0234472Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:33:33.0234784Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:33.0235192Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:33.0235363Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:33.0235733Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:33.0235848Z fn() 2025-05-07T20:33:33.0236281Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:33.0236367Z self.fn.run( 2025-05-07T20:33:33.0236726Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:33.0236824Z kernel = self.compile( 2025-05-07T20:33:33.0237239Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:33.0237415Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:33.0237546Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:33.0237553Z 2025-05-07T20:33:33.0237764Z self = 2025-05-07T20:33:33.0238619Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:33.0239225Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f58fe034160>} 2025-05-07T20:33:33.0240046Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:33.0240239Z context = 2025-05-07T20:33:33.0240244Z 2025-05-07T20:33:33.0240415Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:33.0240695Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:33.0240804Z module_map=module_map) 2025-05-07T20:33:33.0240969Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:33.0241070Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:33.0241188Z E ^ 2025-05-07T20:33:33.0241576Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:33.0241581Z 2025-05-07T20:33:33.0242024Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:33.0242032Z 2025-05-07T20:33:33.0242133Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:33.0242368Z self=, 2025-05-07T20:33:33.0242487Z T=1, 2025-05-07T20:33:33.0242566Z D=5120, 2025-05-07T20:33:33.0242651Z scale_ub=1200.0, 2025-05-07T20:33:33.0242736Z contiguous=False, 2025-05-07T20:33:33.0242817Z compiled=True, 2025-05-07T20:33:33.0242892Z ) 2025-05-07T20:33:33.0243113Z self = 2025-05-07T20:33:33.0243288Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:33.0243295Z 2025-05-07T20:33:33.0243374Z @given( 2025-05-07T20:33:33.0243491Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:33.0243594Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:33.0243707Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:33.0243869Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:33.0243982Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:33.0244058Z ) 2025-05-07T20:33:33.0244321Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:33.0244417Z def test_silu_mul_quant( 2025-05-07T20:33:33.0244490Z self, 2025-05-07T20:33:33.0244569Z T: int, 2025-05-07T20:33:33.0244643Z D: int, 2025-05-07T20:33:33.0244781Z scale_ub: Optional[float], 2025-05-07T20:33:33.0244872Z contiguous: bool, 2025-05-07T20:33:33.0244958Z compiled: bool, 2025-05-07T20:33:33.0245037Z ) -> None: 2025-05-07T20:33:33.0245138Z torch.manual_seed(2025) 2025-05-07T20:33:33.0245211Z 2025-05-07T20:33:33.0245388Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:33.0245463Z 2025-05-07T20:33:33.0245552Z x_sign = torch.sign(x) 2025-05-07T20:33:33.0245679Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:33.0245766Z x = x_sign * x_clamp 2025-05-07T20:33:33.0245847Z x0 = x[:, :D] 2025-05-07T20:33:33.0245926Z x1 = x[:, D:] 2025-05-07T20:33:33.0245998Z 2025-05-07T20:33:33.0246083Z if contiguous: 2025-05-07T20:33:33.0246176Z x0 = x0.contiguous() 2025-05-07T20:33:33.0246264Z x1 = x1.contiguous() 2025-05-07T20:33:33.0246339Z 2025-05-07T20:33:33.0246431Z if scale_ub is not None: 2025-05-07T20:33:33.0246538Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:33.0246675Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:33.0246746Z ) 2025-05-07T20:33:33.0246822Z else: 2025-05-07T20:33:33.0246916Z scale_ub_tensor = None 2025-05-07T20:33:33.0246991Z 2025-05-07T20:33:33.0247117Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:33.0247211Z op = silu_mul_quant 2025-05-07T20:33:33.0247296Z if compiled: 2025-05-07T20:33:33.0247395Z op = torch.compile(op) 2025-05-07T20:33:33.0247505Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:33.0247582Z 2025-05-07T20:33:33.0247674Z > y_fp8, y_scale = fn() 2025-05-07T20:33:33.0247679Z 2025-05-07T20:33:33.0247780Z moe/activation_test.py:117: 2025-05-07T20:33:33.0247911Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:33.0248021Z moe/activation_test.py:115: in fn 2025-05-07T20:33:33.0248120Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:33.0248559Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:33.0248655Z return fn(*args, **kwargs) 2025-05-07T20:33:33.0249248Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:33.0249343Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:33.0249731Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:33.0249964Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:33.0250371Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:33.0250462Z kernel = self.compile( 2025-05-07T20:33:33.0250872Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:33.0252576Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:33.0252707Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:33.0252712Z 2025-05-07T20:33:33.0252929Z self = 2025-05-07T20:33:33.0253824Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:33.0254379Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f58fe034b80>} 2025-05-07T20:33:33.0255201Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:33.0255439Z context = 2025-05-07T20:33:33.0255444Z 2025-05-07T20:33:33.0255619Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:33.0255894Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:33.0256007Z module_map=module_map) 2025-05-07T20:33:33.0256174Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:33.0256271Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:33.0256353Z E ^ 2025-05-07T20:33:33.0260347Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:33.0260355Z 2025-05-07T20:33:33.0260831Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:33.0260841Z 2025-05-07T20:33:33.0260949Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:33.0261185Z self=, 2025-05-07T20:33:33.0261264Z T=1, 2025-05-07T20:33:33.0261345Z D=5120, 2025-05-07T20:33:33.0261429Z scale_ub=1200.0, 2025-05-07T20:33:33.0261515Z contiguous=False, 2025-05-07T20:33:33.0261601Z compiled=False, 2025-05-07T20:33:33.0261675Z ) 2025-05-07T20:33:33.0261905Z self = 2025-05-07T20:33:33.0262086Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:33.0262094Z 2025-05-07T20:33:33.0262172Z @given( 2025-05-07T20:33:33.0262293Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:33.0262391Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:33.0262506Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:33.0262629Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:33.0262742Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:33.0262819Z ) 2025-05-07T20:33:33.0263154Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:33.0263250Z def test_silu_mul_quant( 2025-05-07T20:33:33.0263329Z self, 2025-05-07T20:33:33.0263404Z T: int, 2025-05-07T20:33:33.0263477Z D: int, 2025-05-07T20:33:33.0263579Z scale_ub: Optional[float], 2025-05-07T20:33:33.0263666Z contiguous: bool, 2025-05-07T20:33:33.0263750Z compiled: bool, 2025-05-07T20:33:33.0263832Z ) -> None: 2025-05-07T20:33:33.0263995Z torch.manual_seed(2025) 2025-05-07T20:33:33.0264066Z 2025-05-07T20:33:33.0264247Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:33.0264319Z 2025-05-07T20:33:33.0264409Z x_sign = torch.sign(x) 2025-05-07T20:33:33.0264537Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:33.0264626Z x = x_sign * x_clamp 2025-05-07T20:33:33.0264710Z x0 = x[:, :D] 2025-05-07T20:33:33.0264794Z x1 = x[:, D:] 2025-05-07T20:33:33.0264866Z 2025-05-07T20:33:33.0264952Z if contiguous: 2025-05-07T20:33:33.0265043Z x0 = x0.contiguous() 2025-05-07T20:33:33.0265132Z x1 = x1.contiguous() 2025-05-07T20:33:33.0265208Z 2025-05-07T20:33:33.0265298Z if scale_ub is not None: 2025-05-07T20:33:33.0265448Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:33.0265587Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:33.0265662Z ) 2025-05-07T20:33:33.0265738Z else: 2025-05-07T20:33:33.0265836Z scale_ub_tensor = None 2025-05-07T20:33:33.0265908Z 2025-05-07T20:33:33.0266038Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:33.0266172Z op = silu_mul_quant 2025-05-07T20:33:33.0266256Z if compiled: 2025-05-07T20:33:33.0266361Z op = torch.compile(op) 2025-05-07T20:33:33.0266467Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:33.0266539Z 2025-05-07T20:33:33.0266632Z > y_fp8, y_scale = fn() 2025-05-07T20:33:33.0266636Z 2025-05-07T20:33:33.0266734Z moe/activation_test.py:117: 2025-05-07T20:33:33.0266868Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:33.0266978Z moe/activation_test.py:115: in fn 2025-05-07T20:33:33.0267075Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:33.0267624Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:33.0267726Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:33.0268116Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:33.0268356Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:33.0268721Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:33.0268815Z kernel = self.compile( 2025-05-07T20:33:33.0269228Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:33.0269406Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:33.0269539Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:33.0269544Z 2025-05-07T20:33:33.0269896Z self = 2025-05-07T20:33:33.0270757Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:33.0271365Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f58fe2b2550>} 2025-05-07T20:33:33.0272179Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:33.0272379Z context = 2025-05-07T20:33:33.0272383Z 2025-05-07T20:33:33.0272552Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:33.0272834Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:33.0272983Z module_map=module_map) 2025-05-07T20:33:33.0273152Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:33.0273253Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:33.0273331Z E ^ 2025-05-07T20:33:33.0273722Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:33.0273728Z 2025-05-07T20:33:33.0274175Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:33.0274180Z 2025-05-07T20:33:33.0274285Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:33.0274556Z self=, 2025-05-07T20:33:33.0274637Z T=16384, 2025-05-07T20:33:33.0274719Z D=5120, 2025-05-07T20:33:33.0274801Z scale_ub=1200.0, 2025-05-07T20:33:33.0274891Z contiguous=False, 2025-05-07T20:33:33.0274982Z compiled=True, 2025-05-07T20:33:33.0275058Z ) 2025-05-07T20:33:33.0275280Z self = 2025-05-07T20:33:33.0275511Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:33.0275516Z 2025-05-07T20:33:33.0275593Z @given( 2025-05-07T20:33:33.0275714Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:33.0275818Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:33.0275933Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:33.0276055Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:33.0276169Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:33.0276247Z ) 2025-05-07T20:33:33.0276511Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:33.0276606Z def test_silu_mul_quant( 2025-05-07T20:33:33.0276684Z self, 2025-05-07T20:33:33.0276768Z T: int, 2025-05-07T20:33:33.0276848Z D: int, 2025-05-07T20:33:33.0276948Z scale_ub: Optional[float], 2025-05-07T20:33:33.0277037Z contiguous: bool, 2025-05-07T20:33:33.0277126Z compiled: bool, 2025-05-07T20:33:33.0277210Z ) -> None: 2025-05-07T20:33:33.0277300Z torch.manual_seed(2025) 2025-05-07T20:33:33.0277372Z 2025-05-07T20:33:33.0277549Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:33.0277622Z 2025-05-07T20:33:33.0277713Z x_sign = torch.sign(x) 2025-05-07T20:33:33.0277842Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:33.0277930Z x = x_sign * x_clamp 2025-05-07T20:33:33.0278011Z x0 = x[:, :D] 2025-05-07T20:33:33.0278098Z x1 = x[:, D:] 2025-05-07T20:33:33.0278172Z 2025-05-07T20:33:33.0278258Z if contiguous: 2025-05-07T20:33:33.0278351Z x0 = x0.contiguous() 2025-05-07T20:33:33.0278444Z x1 = x1.contiguous() 2025-05-07T20:33:33.0278519Z 2025-05-07T20:33:33.0278611Z if scale_ub is not None: 2025-05-07T20:33:33.0278715Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:33.0278851Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:33.0278929Z ) 2025-05-07T20:33:33.0279006Z else: 2025-05-07T20:33:33.0279104Z scale_ub_tensor = None 2025-05-07T20:33:33.0279176Z 2025-05-07T20:33:33.0279353Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:33.0279445Z op = silu_mul_quant 2025-05-07T20:33:33.0279527Z if compiled: 2025-05-07T20:33:33.0279626Z op = torch.compile(op) 2025-05-07T20:33:33.0279731Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:33.0279807Z 2025-05-07T20:33:33.0279905Z > y_fp8, y_scale = fn() 2025-05-07T20:33:33.0279910Z 2025-05-07T20:33:33.0280004Z moe/activation_test.py:117: 2025-05-07T20:33:33.0280176Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:33.0280278Z moe/activation_test.py:115: in fn 2025-05-07T20:33:33.0280375Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:33.0280769Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:33.0280865Z return fn(*args, **kwargs) 2025-05-07T20:33:33.0281406Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:33.0281506Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:33.0281888Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:33.0282163Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:33.0282530Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:33.0282625Z kernel = self.compile( 2025-05-07T20:33:33.0283293Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:33.0283577Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:33.0283707Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:33.0283712Z 2025-05-07T20:33:33.0283934Z self = 2025-05-07T20:33:33.0284790Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:33.0285345Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f58fe22a1f0>} 2025-05-07T20:33:33.0286160Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:33.0286358Z context = 2025-05-07T20:33:33.0286362Z 2025-05-07T20:33:33.0286531Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:33.0286807Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:33.0286918Z module_map=module_map) 2025-05-07T20:33:33.0287083Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:33.0287181Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:33.0287260Z E ^ 2025-05-07T20:33:33.0287642Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:33.0287649Z 2025-05-07T20:33:33.0288097Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:33.0288104Z 2025-05-07T20:33:33.0288210Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:33.0288439Z self=, 2025-05-07T20:33:33.0288520Z T=2048, 2025-05-07T20:33:33.0288594Z D=7168, 2025-05-07T20:33:33.0288738Z scale_ub=1200.0, 2025-05-07T20:33:33.0288829Z contiguous=False, 2025-05-07T20:33:33.0288914Z compiled=True, 2025-05-07T20:33:33.0288991Z ) 2025-05-07T20:33:33.0289218Z self = 2025-05-07T20:33:33.0289401Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:33.0289405Z 2025-05-07T20:33:33.0289482Z @given( 2025-05-07T20:33:33.0289605Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:33.0289766Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:33.0289884Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:33.0290000Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:33.0290114Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:33.0290192Z ) 2025-05-07T20:33:33.0290451Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:33.0290546Z def test_silu_mul_quant( 2025-05-07T20:33:33.0290627Z self, 2025-05-07T20:33:33.0290706Z T: int, 2025-05-07T20:33:33.0290783Z D: int, 2025-05-07T20:33:33.0290885Z scale_ub: Optional[float], 2025-05-07T20:33:33.0290977Z contiguous: bool, 2025-05-07T20:33:33.0291067Z compiled: bool, 2025-05-07T20:33:33.0291208Z ) -> None: 2025-05-07T20:33:33.0291304Z torch.manual_seed(2025) 2025-05-07T20:33:33.0291382Z 2025-05-07T20:33:33.0291553Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:33.0291629Z 2025-05-07T20:33:33.0291723Z x_sign = torch.sign(x) 2025-05-07T20:33:33.0291847Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:33.0291935Z x = x_sign * x_clamp 2025-05-07T20:33:33.0292065Z x0 = x[:, :D] 2025-05-07T20:33:33.0292144Z x1 = x[:, D:] 2025-05-07T20:33:33.0292215Z 2025-05-07T20:33:33.0292300Z if contiguous: 2025-05-07T20:33:33.0292395Z x0 = x0.contiguous() 2025-05-07T20:33:33.0292487Z x1 = x1.contiguous() 2025-05-07T20:33:33.0292566Z 2025-05-07T20:33:33.0292658Z if scale_ub is not None: 2025-05-07T20:33:33.0292766Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:33.0292904Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:33.0292980Z ) 2025-05-07T20:33:33.0293058Z else: 2025-05-07T20:33:33.0293151Z scale_ub_tensor = None 2025-05-07T20:33:33.0293226Z 2025-05-07T20:33:33.0293363Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:33.0293456Z op = silu_mul_quant 2025-05-07T20:33:33.0293541Z if compiled: 2025-05-07T20:33:33.0293642Z op = torch.compile(op) 2025-05-07T20:33:33.0293752Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:33.0293829Z 2025-05-07T20:33:33.0293920Z > y_fp8, y_scale = fn() 2025-05-07T20:33:33.0293925Z 2025-05-07T20:33:33.0294023Z moe/activation_test.py:117: 2025-05-07T20:33:33.0294157Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:33.0294258Z moe/activation_test.py:115: in fn 2025-05-07T20:33:33.0294358Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:33.0294757Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:33.0294852Z return fn(*args, **kwargs) 2025-05-07T20:33:33.0295391Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:33.0295493Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:33.0295876Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:33.0296115Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:33.0296548Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:33.0296644Z kernel = self.compile( 2025-05-07T20:33:33.0297057Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:33.0297236Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:33.0297373Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:33.0297377Z 2025-05-07T20:33:33.0297635Z self = 2025-05-07T20:33:33.0298481Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:33.0299033Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f58fe22aee0>} 2025-05-07T20:33:33.0299848Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:33.0300086Z context = 2025-05-07T20:33:33.0300091Z 2025-05-07T20:33:33.0300261Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:33.0300542Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:33.0300649Z module_map=module_map) 2025-05-07T20:33:33.0300813Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:33.0300956Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:33.0301035Z E ^ 2025-05-07T20:33:33.0301418Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:33.0301423Z 2025-05-07T20:33:33.0301870Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:33.0301875Z 2025-05-07T20:33:33.0301975Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:33.0302210Z self=, 2025-05-07T20:33:33.0302288Z T=1, 2025-05-07T20:33:33.0302364Z D=5120, 2025-05-07T20:33:33.0302455Z scale_ub=None, 2025-05-07T20:33:33.0302541Z contiguous=False, 2025-05-07T20:33:33.0302624Z compiled=False, 2025-05-07T20:33:33.0302698Z ) 2025-05-07T20:33:33.0302921Z self = 2025-05-07T20:33:33.0303094Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:33.0303098Z 2025-05-07T20:33:33.0303177Z @given( 2025-05-07T20:33:33.0303299Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:33.0303397Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:33.0303516Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:33.0303633Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:33.0303750Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:33.0303824Z ) 2025-05-07T20:33:33.0304080Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:33.0304177Z def test_silu_mul_quant( 2025-05-07T20:33:33.0304258Z self, 2025-05-07T20:33:33.0304337Z T: int, 2025-05-07T20:33:33.0304417Z D: int, 2025-05-07T20:33:33.0304516Z scale_ub: Optional[float], 2025-05-07T20:33:33.0304609Z contiguous: bool, 2025-05-07T20:33:33.0304698Z compiled: bool, 2025-05-07T20:33:33.0304775Z ) -> None: 2025-05-07T20:33:33.0304873Z torch.manual_seed(2025) 2025-05-07T20:33:33.0304951Z 2025-05-07T20:33:33.0305168Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:33.0305249Z 2025-05-07T20:33:33.0305342Z x_sign = torch.sign(x) 2025-05-07T20:33:33.0305468Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:33.0305563Z x = x_sign * x_clamp 2025-05-07T20:33:33.0305649Z x0 = x[:, :D] 2025-05-07T20:33:33.0305731Z x1 = x[:, D:] 2025-05-07T20:33:33.0305810Z 2025-05-07T20:33:33.0305894Z if contiguous: 2025-05-07T20:33:33.0305988Z x0 = x0.contiguous() 2025-05-07T20:33:33.0306126Z x1 = x1.contiguous() 2025-05-07T20:33:33.0306199Z 2025-05-07T20:33:33.0306291Z if scale_ub is not None: 2025-05-07T20:33:33.0306399Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:33.0306537Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:33.0306615Z ) 2025-05-07T20:33:33.0306688Z else: 2025-05-07T20:33:33.0306784Z scale_ub_tensor = None 2025-05-07T20:33:33.0306861Z 2025-05-07T20:33:33.0306990Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:33.0307081Z op = silu_mul_quant 2025-05-07T20:33:33.0307169Z if compiled: 2025-05-07T20:33:33.0307268Z op = torch.compile(op) 2025-05-07T20:33:33.0307414Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:33.0307496Z 2025-05-07T20:33:33.0307588Z > y_fp8, y_scale = fn() 2025-05-07T20:33:33.0307592Z 2025-05-07T20:33:33.0307692Z moe/activation_test.py:117: 2025-05-07T20:33:33.0307825Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:33.0307926Z moe/activation_test.py:115: in fn 2025-05-07T20:33:33.0308029Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:33.0308615Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:33.0308719Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:33.0309112Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:33.0309346Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:33.0309715Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:33.0309888Z kernel = self.compile( 2025-05-07T20:33:33.0310299Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:33.0310486Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:33.0310615Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:33.0310622Z 2025-05-07T20:33:33.0310834Z self = 2025-05-07T20:33:33.0311685Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:33.0312233Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f58fe1705e0>} 2025-05-07T20:33:33.0313053Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:33.0313252Z context = 2025-05-07T20:33:33.0313256Z 2025-05-07T20:33:33.0313431Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:33.0313708Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:33.0313860Z module_map=module_map) 2025-05-07T20:33:33.0314027Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:33.0314124Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:33.0314199Z E ^ 2025-05-07T20:33:33.0314585Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:33.0314589Z 2025-05-07T20:33:33.0315034Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:33.0315079Z 2025-05-07T20:33:33.0315185Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:33.0315413Z self=, 2025-05-07T20:33:33.0315493Z T=4096, 2025-05-07T20:33:33.0315570Z D=7168, 2025-05-07T20:33:33.0315653Z scale_ub=1200.0, 2025-05-07T20:33:33.0315742Z contiguous=False, 2025-05-07T20:33:33.0315828Z compiled=False, 2025-05-07T20:33:33.0315909Z ) 2025-05-07T20:33:33.0316137Z self = 2025-05-07T20:33:33.0316321Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:33.0316326Z 2025-05-07T20:33:33.0316400Z @given( 2025-05-07T20:33:33.0316562Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:33.0316663Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:33.0316778Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:33.0316907Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:33.0317022Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:33.0317099Z ) 2025-05-07T20:33:33.0317360Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:33.0317491Z def test_silu_mul_quant( 2025-05-07T20:33:33.0317570Z self, 2025-05-07T20:33:33.0317645Z T: int, 2025-05-07T20:33:33.0317726Z D: int, 2025-05-07T20:33:33.0317823Z scale_ub: Optional[float], 2025-05-07T20:33:33.0317909Z contiguous: bool, 2025-05-07T20:33:33.0317994Z compiled: bool, 2025-05-07T20:33:33.0318074Z ) -> None: 2025-05-07T20:33:33.0318164Z torch.manual_seed(2025) 2025-05-07T20:33:33.0318233Z 2025-05-07T20:33:33.0318410Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:33.0318485Z 2025-05-07T20:33:33.0318577Z x_sign = torch.sign(x) 2025-05-07T20:33:33.0318707Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:33.0318794Z x = x_sign * x_clamp 2025-05-07T20:33:33.0318882Z x0 = x[:, :D] 2025-05-07T20:33:33.0318958Z x1 = x[:, D:] 2025-05-07T20:33:33.0319033Z 2025-05-07T20:33:33.0319120Z if contiguous: 2025-05-07T20:33:33.0319210Z x0 = x0.contiguous() 2025-05-07T20:33:33.0319295Z x1 = x1.contiguous() 2025-05-07T20:33:33.0319375Z 2025-05-07T20:33:33.0319466Z if scale_ub is not None: 2025-05-07T20:33:33.0319570Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:33.0319703Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:33.0319786Z ) 2025-05-07T20:33:33.0319861Z else: 2025-05-07T20:33:33.0319957Z scale_ub_tensor = None 2025-05-07T20:33:33.0320034Z 2025-05-07T20:33:33.0320162Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:33.0320252Z op = silu_mul_quant 2025-05-07T20:33:33.0320337Z if compiled: 2025-05-07T20:33:33.0320435Z op = torch.compile(op) 2025-05-07T20:33:33.0320541Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:33.0320613Z 2025-05-07T20:33:33.0320705Z > y_fp8, y_scale = fn() 2025-05-07T20:33:33.0320709Z 2025-05-07T20:33:33.0320810Z moe/activation_test.py:117: 2025-05-07T20:33:33.0320984Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:33.0321083Z moe/activation_test.py:115: in fn 2025-05-07T20:33:33.0321183Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:33.0321722Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:33.0321826Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:33.0322208Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:33.0322439Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:33.0322854Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:33.0322948Z kernel = self.compile( 2025-05-07T20:33:33.0323356Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:33.0323536Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:33.0323664Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:33.0323668Z 2025-05-07T20:33:33.0323877Z self = 2025-05-07T20:33:33.0324764Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:33.0325315Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f58fdab01f0>} 2025-05-07T20:33:33.0326125Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:33.0326387Z context = 2025-05-07T20:33:33.0326391Z 2025-05-07T20:33:33.0326560Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:33.0326833Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:33.0326942Z module_map=module_map) 2025-05-07T20:33:33.0327106Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:33.0327202Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:33.0327285Z E ^ 2025-05-07T20:33:33.0327662Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:33.0327667Z 2025-05-07T20:33:33.0328110Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:33.0328118Z 2025-05-07T20:33:33.0328216Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:33.0328447Z self=, 2025-05-07T20:33:33.0328523Z T=16384, 2025-05-07T20:33:33.0328598Z D=7168, 2025-05-07T20:33:33.0328679Z scale_ub=None, 2025-05-07T20:33:33.0328764Z contiguous=True, 2025-05-07T20:33:33.0328845Z compiled=True, 2025-05-07T20:33:33.0328938Z ) 2025-05-07T20:33:33.0329196Z self = 2025-05-07T20:33:33.0329372Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:33.0329379Z 2025-05-07T20:33:33.0329456Z @given( 2025-05-07T20:33:33.0329575Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:33.0329673Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:33.0329791Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:33.0329905Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:33.0330060Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:33.0330140Z ) 2025-05-07T20:33:33.0330397Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:33.0330485Z def test_silu_mul_quant( 2025-05-07T20:33:33.0330561Z self, 2025-05-07T20:33:33.0330634Z T: int, 2025-05-07T20:33:33.0330709Z D: int, 2025-05-07T20:33:33.0330809Z scale_ub: Optional[float], 2025-05-07T20:33:33.0330895Z contiguous: bool, 2025-05-07T20:33:33.0330984Z compiled: bool, 2025-05-07T20:33:33.0331102Z ) -> None: 2025-05-07T20:33:33.0331194Z torch.manual_seed(2025) 2025-05-07T20:33:33.0331271Z 2025-05-07T20:33:33.0331442Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:33.0331518Z 2025-05-07T20:33:33.0331609Z x_sign = torch.sign(x) 2025-05-07T20:33:33.0331730Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:33.0331817Z x = x_sign * x_clamp 2025-05-07T20:33:33.0331901Z x0 = x[:, :D] 2025-05-07T20:33:33.0331980Z x1 = x[:, D:] 2025-05-07T20:33:33.0332052Z 2025-05-07T20:33:33.0332136Z if contiguous: 2025-05-07T20:33:33.0332223Z x0 = x0.contiguous() 2025-05-07T20:33:33.0332316Z x1 = x1.contiguous() 2025-05-07T20:33:33.0332387Z 2025-05-07T20:33:33.0332519Z if scale_ub is not None: 2025-05-07T20:33:33.0332626Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:33.0332761Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:33.0332838Z ) 2025-05-07T20:33:33.0332915Z else: 2025-05-07T20:33:33.0333008Z scale_ub_tensor = None 2025-05-07T20:33:33.0333080Z 2025-05-07T20:33:33.0333209Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:33.0333342Z op = silu_mul_quant 2025-05-07T20:33:33.0333423Z if compiled: 2025-05-07T20:33:33.0333524Z op = torch.compile(op) 2025-05-07T20:33:33.0333630Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:33.0333704Z 2025-05-07T20:33:33.0333793Z > y_fp8, y_scale = fn() 2025-05-07T20:33:33.0333797Z 2025-05-07T20:33:33.0333891Z moe/activation_test.py:117: 2025-05-07T20:33:33.0334022Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:33.0334124Z moe/activation_test.py:115: in fn 2025-05-07T20:33:33.0334223Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:33.0334622Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:33.0334714Z return fn(*args, **kwargs) 2025-05-07T20:33:33.0335249Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:33.0335354Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:33.0335736Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:33.0335969Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:33.0336329Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:33.0336421Z kernel = self.compile( 2025-05-07T20:33:33.0336834Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:33.0337010Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:33.0337142Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:33.0337147Z 2025-05-07T20:33:33.0337355Z self = 2025-05-07T20:33:33.0338246Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:33.0338795Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f58fdab0ee0>} 2025-05-07T20:33:33.0339660Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:33.0339857Z context = 2025-05-07T20:33:33.0339899Z 2025-05-07T20:33:33.0340066Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:33.0340339Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:33.0340448Z module_map=module_map) 2025-05-07T20:33:33.0340612Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:33.0340712Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:33.0340787Z E ^ 2025-05-07T20:33:33.0341167Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:33.0341171Z 2025-05-07T20:33:33.0341657Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:33.0341662Z 2025-05-07T20:33:33.0341761Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:33.0341998Z self=, 2025-05-07T20:33:33.0342073Z T=4096, 2025-05-07T20:33:33.0342148Z D=5120, 2025-05-07T20:33:33.0342229Z scale_ub=None, 2025-05-07T20:33:33.0342357Z contiguous=False, 2025-05-07T20:33:33.0342441Z compiled=True, 2025-05-07T20:33:33.0342516Z ) 2025-05-07T20:33:33.0342740Z self = 2025-05-07T20:33:33.0342917Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:33.0342921Z 2025-05-07T20:33:33.0342996Z @given( 2025-05-07T20:33:33.0343114Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:33.0343218Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:33.0343334Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:33.0343448Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:33.0343564Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:33.0343637Z ) 2025-05-07T20:33:33.0343893Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:33.0343985Z def test_silu_mul_quant( 2025-05-07T20:33:33.0344062Z self, 2025-05-07T20:33:33.0344139Z T: int, 2025-05-07T20:33:33.0344216Z D: int, 2025-05-07T20:33:33.0344313Z scale_ub: Optional[float], 2025-05-07T20:33:33.0344401Z contiguous: bool, 2025-05-07T20:33:33.0344496Z compiled: bool, 2025-05-07T20:33:33.0344573Z ) -> None: 2025-05-07T20:33:33.0344669Z torch.manual_seed(2025) 2025-05-07T20:33:33.0344738Z 2025-05-07T20:33:33.0344905Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:33.0344979Z 2025-05-07T20:33:33.0345072Z x_sign = torch.sign(x) 2025-05-07T20:33:33.0345197Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:33.0345285Z x = x_sign * x_clamp 2025-05-07T20:33:33.0345367Z x0 = x[:, :D] 2025-05-07T20:33:33.0345446Z x1 = x[:, D:] 2025-05-07T20:33:33.0345523Z 2025-05-07T20:33:33.0345605Z if contiguous: 2025-05-07T20:33:33.0345694Z x0 = x0.contiguous() 2025-05-07T20:33:33.0345787Z x1 = x1.contiguous() 2025-05-07T20:33:33.0345856Z 2025-05-07T20:33:33.0345945Z if scale_ub is not None: 2025-05-07T20:33:33.0346055Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:33.0346240Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:33.0346318Z ) 2025-05-07T20:33:33.0346392Z else: 2025-05-07T20:33:33.0346484Z scale_ub_tensor = None 2025-05-07T20:33:33.0346560Z 2025-05-07T20:33:33.0346686Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:33.0346774Z op = silu_mul_quant 2025-05-07T20:33:33.0346858Z if compiled: 2025-05-07T20:33:33.0346957Z op = torch.compile(op) 2025-05-07T20:33:33.0347101Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:33.0347173Z 2025-05-07T20:33:33.0347263Z > y_fp8, y_scale = fn() 2025-05-07T20:33:33.0347267Z 2025-05-07T20:33:33.0347364Z moe/activation_test.py:117: 2025-05-07T20:33:33.0347498Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:33.0347596Z moe/activation_test.py:115: in fn 2025-05-07T20:33:33.0347697Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:33.0348091Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:33.0348184Z return fn(*args, **kwargs) 2025-05-07T20:33:33.0348765Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:33.0348863Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:33.0349244Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:33.0349477Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:33.0349904Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:33.0350044Z kernel = self.compile( 2025-05-07T20:33:33.0350450Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:33.0350629Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:33.0350760Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:33.0350765Z 2025-05-07T20:33:33.0350971Z self = 2025-05-07T20:33:33.0351823Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:33.0352368Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f58fe13a940>} 2025-05-07T20:33:33.0353188Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:33.0353381Z context = 2025-05-07T20:33:33.0353386Z 2025-05-07T20:33:33.0353552Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:33.0353833Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:33.0353937Z module_map=module_map) 2025-05-07T20:33:33.0354100Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:33.0354202Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:33.0354277Z E ^ 2025-05-07T20:33:33.0354659Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:33.0354667Z 2025-05-07T20:33:33.0355111Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:33.0355115Z 2025-05-07T20:33:33.0355260Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:33.0355497Z self=, 2025-05-07T20:33:33.0355572Z T=4096, 2025-05-07T20:33:33.0355650Z D=5120, 2025-05-07T20:33:33.0355729Z scale_ub=1200.0, 2025-05-07T20:33:33.0355810Z contiguous=False, 2025-05-07T20:33:33.0355900Z compiled=False, 2025-05-07T20:33:33.0355974Z ) 2025-05-07T20:33:33.0356196Z self = 2025-05-07T20:33:33.0356419Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:33.0356424Z 2025-05-07T20:33:33.0356496Z @given( 2025-05-07T20:33:33.0356614Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:33.0356716Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:33.0356831Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:33.0356953Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:33.0357066Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:33.0357140Z ) 2025-05-07T20:33:33.0357399Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:33.0357490Z def test_silu_mul_quant( 2025-05-07T20:33:33.0357565Z self, 2025-05-07T20:33:33.0357705Z T: int, 2025-05-07T20:33:33.0357780Z D: int, 2025-05-07T20:33:33.0357877Z scale_ub: Optional[float], 2025-05-07T20:33:33.0357964Z contiguous: bool, 2025-05-07T20:33:33.0358051Z compiled: bool, 2025-05-07T20:33:33.0358128Z ) -> None: 2025-05-07T20:33:33.0358222Z torch.manual_seed(2025) 2025-05-07T20:33:33.0358294Z 2025-05-07T20:33:33.0358466Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:33.0358579Z 2025-05-07T20:33:33.0358669Z x_sign = torch.sign(x) 2025-05-07T20:33:33.0358796Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:33.0358883Z x = x_sign * x_clamp 2025-05-07T20:33:33.0358962Z x0 = x[:, :D] 2025-05-07T20:33:33.0359044Z x1 = x[:, D:] 2025-05-07T20:33:33.0359113Z 2025-05-07T20:33:33.0359196Z if contiguous: 2025-05-07T20:33:33.0359288Z x0 = x0.contiguous() 2025-05-07T20:33:33.0359375Z x1 = x1.contiguous() 2025-05-07T20:33:33.0359446Z 2025-05-07T20:33:33.0359539Z if scale_ub is not None: 2025-05-07T20:33:33.0359641Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:33.0359779Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:33.0359852Z ) 2025-05-07T20:33:33.0359925Z else: 2025-05-07T20:33:33.0360021Z scale_ub_tensor = None 2025-05-07T20:33:33.0360095Z 2025-05-07T20:33:33.0360222Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:33.0360311Z op = silu_mul_quant 2025-05-07T20:33:33.0360395Z if compiled: 2025-05-07T20:33:33.0360495Z op = torch.compile(op) 2025-05-07T20:33:33.0360605Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:33.0360677Z 2025-05-07T20:33:33.0360765Z > y_fp8, y_scale = fn() 2025-05-07T20:33:33.0360769Z 2025-05-07T20:33:33.0360867Z moe/activation_test.py:117: 2025-05-07T20:33:33.0360999Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:33.0361100Z moe/activation_test.py:115: in fn 2025-05-07T20:33:33.0361199Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:33.0361739Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:33.0361838Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:33.0362224Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:33.0362456Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:33.0362865Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:33.0362958Z kernel = self.compile( 2025-05-07T20:33:33.0363367Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:33.0363543Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:33.0363670Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:33.0363714Z 2025-05-07T20:33:33.0363924Z self = 2025-05-07T20:33:33.0364771Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:33.0365325Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f58fdbd93a0>} 2025-05-07T20:33:33.0366175Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:33.0366372Z context = 2025-05-07T20:33:33.0366379Z 2025-05-07T20:33:33.0366547Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:33.0366823Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:33.0366928Z module_map=module_map) 2025-05-07T20:33:33.0367128Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:33.0367225Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:33.0367302Z E ^ 2025-05-07T20:33:33.0367686Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:33.0367691Z 2025-05-07T20:33:33.0368137Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:33.0368142Z 2025-05-07T20:33:33.0368244Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:33.0368474Z self=, 2025-05-07T20:33:33.0368551Z T=4096, 2025-05-07T20:33:33.0368624Z D=5120, 2025-05-07T20:33:33.0368711Z scale_ub=1200.0, 2025-05-07T20:33:33.0368815Z contiguous=False, 2025-05-07T20:33:33.0368905Z compiled=True, 2025-05-07T20:33:33.0368991Z ) 2025-05-07T20:33:33.0369231Z self = 2025-05-07T20:33:33.0369409Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:33.0369414Z 2025-05-07T20:33:33.0369493Z @given( 2025-05-07T20:33:33.0369610Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:33.0369705Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:33.0369823Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:33.0369937Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:33.0370051Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:33.0370125Z ) 2025-05-07T20:33:33.0370381Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:33.0370475Z def test_silu_mul_quant( 2025-05-07T20:33:33.0370549Z self, 2025-05-07T20:33:33.0370625Z T: int, 2025-05-07T20:33:33.0370701Z D: int, 2025-05-07T20:33:33.0370796Z scale_ub: Optional[float], 2025-05-07T20:33:33.0370886Z contiguous: bool, 2025-05-07T20:33:33.0370973Z compiled: bool, 2025-05-07T20:33:33.0371050Z ) -> None: 2025-05-07T20:33:33.0371187Z torch.manual_seed(2025) 2025-05-07T20:33:33.0371263Z 2025-05-07T20:33:33.0371431Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:33.0371503Z 2025-05-07T20:33:33.0371596Z x_sign = torch.sign(x) 2025-05-07T20:33:33.0371720Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:33.0371810Z x = x_sign * x_clamp 2025-05-07T20:33:33.0371893Z x0 = x[:, :D] 2025-05-07T20:33:33.0371973Z x1 = x[:, D:] 2025-05-07T20:33:33.0372046Z 2025-05-07T20:33:33.0372171Z if contiguous: 2025-05-07T20:33:33.0372262Z x0 = x0.contiguous() 2025-05-07T20:33:33.0372350Z x1 = x1.contiguous() 2025-05-07T20:33:33.0372424Z 2025-05-07T20:33:33.0372512Z if scale_ub is not None: 2025-05-07T20:33:33.0372620Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:33.0372753Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:33.0372830Z ) 2025-05-07T20:33:33.0372912Z else: 2025-05-07T20:33:33.0373004Z scale_ub_tensor = None 2025-05-07T20:33:33.0373077Z 2025-05-07T20:33:33.0373207Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:33.0373296Z op = silu_mul_quant 2025-05-07T20:33:33.0373382Z if compiled: 2025-05-07T20:33:33.0373522Z op = torch.compile(op) 2025-05-07T20:33:33.0373628Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:33.0373702Z 2025-05-07T20:33:33.0373793Z > y_fp8, y_scale = fn() 2025-05-07T20:33:33.0373800Z 2025-05-07T20:33:33.0373896Z moe/activation_test.py:117: 2025-05-07T20:33:33.0374030Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:33.0374130Z moe/activation_test.py:115: in fn 2025-05-07T20:33:33.0374269Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:33.0374661Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:33.0374754Z return fn(*args, **kwargs) 2025-05-07T20:33:33.0375292Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:33.0375387Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:33.0375769Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:33.0376005Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:33.0376366Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:33.0376458Z kernel = self.compile( 2025-05-07T20:33:33.0376867Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:33.0377046Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:33.0377178Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:33.0377183Z 2025-05-07T20:33:33.0377393Z self = 2025-05-07T20:33:33.0378242Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:33.0378792Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f58fdbd9280>} 2025-05-07T20:33:33.0379653Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:33.0379852Z context = 2025-05-07T20:33:33.0379898Z 2025-05-07T20:33:33.0380064Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:33.0380340Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:33.0380445Z module_map=module_map) 2025-05-07T20:33:33.0380607Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:33.0380706Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:33.0380780Z E ^ 2025-05-07T20:33:33.0381156Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:33.0381199Z 2025-05-07T20:33:33.0381645Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:33.0381652Z 2025-05-07T20:33:33.0381751Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:33.0381986Z self=, 2025-05-07T20:33:33.0382061Z T=2048, 2025-05-07T20:33:33.0382137Z D=7168, 2025-05-07T20:33:33.0382225Z scale_ub=1200.0, 2025-05-07T20:33:33.0382309Z contiguous=False, 2025-05-07T20:33:33.0382394Z compiled=False, 2025-05-07T20:33:33.0382473Z ) 2025-05-07T20:33:33.0382909Z self = 2025-05-07T20:33:33.0383147Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:33.0383156Z 2025-05-07T20:33:33.0383238Z @given( 2025-05-07T20:33:33.0383357Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:33.0383455Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:33.0387104Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:33.0387353Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:33.0387470Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:33.0387550Z ) 2025-05-07T20:33:33.0387816Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:33.0387915Z def test_silu_mul_quant( 2025-05-07T20:33:33.0387993Z self, 2025-05-07T20:33:33.0388069Z T: int, 2025-05-07T20:33:33.0388146Z D: int, 2025-05-07T20:33:33.0388246Z scale_ub: Optional[float], 2025-05-07T20:33:33.0388341Z contiguous: bool, 2025-05-07T20:33:33.0388436Z compiled: bool, 2025-05-07T20:33:33.0388516Z ) -> None: 2025-05-07T20:33:33.0388614Z torch.manual_seed(2025) 2025-05-07T20:33:33.0388691Z 2025-05-07T20:33:33.0388870Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:33.0388959Z 2025-05-07T20:33:33.0389067Z x_sign = torch.sign(x) 2025-05-07T20:33:33.0389218Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:33.0389308Z x = x_sign * x_clamp 2025-05-07T20:33:33.0389386Z x0 = x[:, :D] 2025-05-07T20:33:33.0389467Z x1 = x[:, D:] 2025-05-07T20:33:33.0389542Z 2025-05-07T20:33:33.0389624Z if contiguous: 2025-05-07T20:33:33.0389715Z x0 = x0.contiguous() 2025-05-07T20:33:33.0389879Z x1 = x1.contiguous() 2025-05-07T20:33:33.0389953Z 2025-05-07T20:33:33.0390044Z if scale_ub is not None: 2025-05-07T20:33:33.0390156Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:33.0390293Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:33.0390374Z ) 2025-05-07T20:33:33.0390453Z else: 2025-05-07T20:33:33.0390548Z scale_ub_tensor = None 2025-05-07T20:33:33.0390623Z 2025-05-07T20:33:33.0390751Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:33.0390842Z op = silu_mul_quant 2025-05-07T20:33:33.0390938Z if compiled: 2025-05-07T20:33:33.0391039Z op = torch.compile(op) 2025-05-07T20:33:33.0391145Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:33.0391292Z 2025-05-07T20:33:33.0391385Z > y_fp8, y_scale = fn() 2025-05-07T20:33:33.0391390Z 2025-05-07T20:33:33.0391493Z moe/activation_test.py:117: 2025-05-07T20:33:33.0391628Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:33.0391727Z moe/activation_test.py:115: in fn 2025-05-07T20:33:33.0391833Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:33.0392384Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:33.0392567Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:33.0392958Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:33.0393196Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:33.0393564Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:33.0393659Z kernel = self.compile( 2025-05-07T20:33:33.0394072Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:33.0394256Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:33.0394447Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:33.0394452Z 2025-05-07T20:33:33.0394665Z self = 2025-05-07T20:33:33.0395530Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:33.0396124Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f58fde39670>} 2025-05-07T20:33:33.0396953Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:33.0397154Z context = 2025-05-07T20:33:33.0397161Z 2025-05-07T20:33:33.0397341Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:33.0397621Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:33.0397731Z module_map=module_map) 2025-05-07T20:33:33.0397900Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:33.0397998Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:33.0398078Z E ^ 2025-05-07T20:33:33.0398468Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:33.0398475Z 2025-05-07T20:33:33.0398924Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:33.0398928Z 2025-05-07T20:33:33.0399038Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:33.0399271Z self=, 2025-05-07T20:33:33.0399348Z T=1, 2025-05-07T20:33:33.0399430Z D=7168, 2025-05-07T20:33:33.0399515Z scale_ub=None, 2025-05-07T20:33:33.0399601Z contiguous=True, 2025-05-07T20:33:33.0399691Z compiled=False, 2025-05-07T20:33:33.0399768Z ) 2025-05-07T20:33:33.0399994Z self = 2025-05-07T20:33:33.0400163Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:33.0400170Z 2025-05-07T20:33:33.0400257Z @given( 2025-05-07T20:33:33.0400376Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:33.0400519Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:33.0400638Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:33.0400756Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:33.0400874Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:33.0400953Z ) 2025-05-07T20:33:33.0401216Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:33.0401313Z def test_silu_mul_quant( 2025-05-07T20:33:33.0401390Z self, 2025-05-07T20:33:33.0401507Z T: int, 2025-05-07T20:33:33.0401591Z D: int, 2025-05-07T20:33:33.0401689Z scale_ub: Optional[float], 2025-05-07T20:33:33.0401778Z contiguous: bool, 2025-05-07T20:33:33.0401872Z compiled: bool, 2025-05-07T20:33:33.0401954Z ) -> None: 2025-05-07T20:33:33.0402049Z torch.manual_seed(2025) 2025-05-07T20:33:33.0402126Z 2025-05-07T20:33:33.0402300Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:33.0402378Z 2025-05-07T20:33:33.0402476Z x_sign = torch.sign(x) 2025-05-07T20:33:33.0402604Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:33.0402694Z x = x_sign * x_clamp 2025-05-07T20:33:33.0402776Z x0 = x[:, :D] 2025-05-07T20:33:33.0402857Z x1 = x[:, D:] 2025-05-07T20:33:33.0402973Z 2025-05-07T20:33:33.0403059Z if contiguous: 2025-05-07T20:33:33.0403151Z x0 = x0.contiguous() 2025-05-07T20:33:33.0403246Z x1 = x1.contiguous() 2025-05-07T20:33:33.0403325Z 2025-05-07T20:33:33.0403417Z if scale_ub is not None: 2025-05-07T20:33:33.0403530Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:33.0403668Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:33.0403785Z ) 2025-05-07T20:33:33.0403864Z else: 2025-05-07T20:33:33.0403959Z scale_ub_tensor = None 2025-05-07T20:33:33.0404037Z 2025-05-07T20:33:33.0404170Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:33.0404258Z op = silu_mul_quant 2025-05-07T20:33:33.0404347Z if compiled: 2025-05-07T20:33:33.0404447Z op = torch.compile(op) 2025-05-07T20:33:33.0404554Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:33.0404630Z 2025-05-07T20:33:33.0404724Z > y_fp8, y_scale = fn() 2025-05-07T20:33:33.0404728Z 2025-05-07T20:33:33.0404825Z moe/activation_test.py:117: 2025-05-07T20:33:33.0404961Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:33.0405064Z moe/activation_test.py:115: in fn 2025-05-07T20:33:33.0405165Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:33.0405711Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:33.0405810Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:33.0406199Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:33.0406433Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:33.0406797Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:33.0406894Z kernel = self.compile( 2025-05-07T20:33:33.0407306Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:33.0407489Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:33.0407620Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:33.0407624Z 2025-05-07T20:33:33.0407837Z self = 2025-05-07T20:33:33.0408740Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:33.0409290Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f58fd9d6280>} 2025-05-07T20:33:33.0410109Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:33.0410343Z context = 2025-05-07T20:33:33.0410348Z 2025-05-07T20:33:33.0410519Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:33.0410803Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:33.0410909Z module_map=module_map) 2025-05-07T20:33:33.0411077Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:33.0411177Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:33.0411252Z E ^ 2025-05-07T20:33:33.0411641Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:33.0411645Z 2025-05-07T20:33:33.0412132Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:33.0412137Z 2025-05-07T20:33:33.0412251Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:33.0412485Z self=, 2025-05-07T20:33:33.0412564Z T=16384, 2025-05-07T20:33:33.0412649Z D=7168, 2025-05-07T20:33:33.0412777Z scale_ub=1200.0, 2025-05-07T20:33:33.0412869Z contiguous=False, 2025-05-07T20:33:33.0412963Z compiled=True, 2025-05-07T20:33:33.0413041Z ) 2025-05-07T20:33:33.0413273Z self = 2025-05-07T20:33:33.0413461Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:33.0413466Z 2025-05-07T20:33:33.0413543Z @given( 2025-05-07T20:33:33.0413666Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:33.0413768Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:33.0413884Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:33.0414008Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:33.0414124Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:33.0414198Z ) 2025-05-07T20:33:33.0414461Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:33.0414557Z def test_silu_mul_quant( 2025-05-07T20:33:33.0414633Z self, 2025-05-07T20:33:33.0414711Z T: int, 2025-05-07T20:33:33.0414787Z D: int, 2025-05-07T20:33:33.0414889Z scale_ub: Optional[float], 2025-05-07T20:33:33.0414979Z contiguous: bool, 2025-05-07T20:33:33.0415066Z compiled: bool, 2025-05-07T20:33:33.0415146Z ) -> None: 2025-05-07T20:33:33.0415240Z torch.manual_seed(2025) 2025-05-07T20:33:33.0415313Z 2025-05-07T20:33:33.0415490Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:33.0415569Z 2025-05-07T20:33:33.0415660Z x_sign = torch.sign(x) 2025-05-07T20:33:33.0415791Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:33.0415882Z x = x_sign * x_clamp 2025-05-07T20:33:33.0415962Z x0 = x[:, :D] 2025-05-07T20:33:33.0416045Z x1 = x[:, D:] 2025-05-07T20:33:33.0416119Z 2025-05-07T20:33:33.0416202Z if contiguous: 2025-05-07T20:33:33.0416300Z x0 = x0.contiguous() 2025-05-07T20:33:33.0416388Z x1 = x1.contiguous() 2025-05-07T20:33:33.0416466Z 2025-05-07T20:33:33.0416559Z if scale_ub is not None: 2025-05-07T20:33:33.0416710Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:33.0416851Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:33.0416925Z ) 2025-05-07T20:33:33.0417002Z else: 2025-05-07T20:33:33.0417101Z scale_ub_tensor = None 2025-05-07T20:33:33.0417177Z 2025-05-07T20:33:33.0417307Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:33.0417401Z op = silu_mul_quant 2025-05-07T20:33:33.0417486Z if compiled: 2025-05-07T20:33:33.0417588Z op = torch.compile(op) 2025-05-07T20:33:33.0417740Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:33.0417812Z 2025-05-07T20:33:33.0417907Z > y_fp8, y_scale = fn() 2025-05-07T20:33:33.0417911Z 2025-05-07T20:33:33.0418011Z moe/activation_test.py:117: 2025-05-07T20:33:33.0418145Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:33.0418252Z moe/activation_test.py:115: in fn 2025-05-07T20:33:33.0418354Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:33.0418752Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:33.0418861Z return fn(*args, **kwargs) 2025-05-07T20:33:33.0419479Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:33.0419581Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:33.0419964Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:33.0420200Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:33.0420566Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:33.0420701Z kernel = self.compile( 2025-05-07T20:33:33.0421114Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:33.0421297Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:33.0421429Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:33.0421434Z 2025-05-07T20:33:33.0421654Z self = 2025-05-07T20:33:33.0422504Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:33.0423055Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f58fd9d6ee0>} 2025-05-07T20:33:33.0423876Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:33.0424073Z context = 2025-05-07T20:33:33.0424077Z 2025-05-07T20:33:33.0424248Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:33.0424527Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:33.0424636Z module_map=module_map) 2025-05-07T20:33:33.0424803Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:33.0424900Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:33.0424979Z E ^ 2025-05-07T20:33:33.0425361Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:33.0425368Z 2025-05-07T20:33:33.0425882Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:33.0425891Z 2025-05-07T20:33:33.0425995Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:33.0426229Z self=, 2025-05-07T20:33:33.0426310Z T=1, 2025-05-07T20:33:33.0426387Z D=7168, 2025-05-07T20:33:33.0426470Z scale_ub=None, 2025-05-07T20:33:33.0426561Z contiguous=False, 2025-05-07T20:33:33.0426646Z compiled=False, 2025-05-07T20:33:33.0426720Z ) 2025-05-07T20:33:33.0426951Z self = 2025-05-07T20:33:33.0427164Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:33.0427169Z 2025-05-07T20:33:33.0427245Z @given( 2025-05-07T20:33:33.0427368Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:33.0427469Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:33.0427588Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:33.0427711Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:33.0427825Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:33.0427902Z ) 2025-05-07T20:33:33.0428160Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:33.0428252Z def test_silu_mul_quant( 2025-05-07T20:33:33.0428369Z self, 2025-05-07T20:33:33.0428449Z T: int, 2025-05-07T20:33:33.0428526Z D: int, 2025-05-07T20:33:33.0428625Z scale_ub: Optional[float], 2025-05-07T20:33:33.0428720Z contiguous: bool, 2025-05-07T20:33:33.0428815Z compiled: bool, 2025-05-07T20:33:33.0428913Z ) -> None: 2025-05-07T20:33:33.0429018Z torch.manual_seed(2025) 2025-05-07T20:33:33.0429108Z 2025-05-07T20:33:33.0429323Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:33.0429398Z 2025-05-07T20:33:33.0429492Z x_sign = torch.sign(x) 2025-05-07T20:33:33.0429620Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:33.0429708Z x = x_sign * x_clamp 2025-05-07T20:33:33.0429878Z x0 = x[:, :D] 2025-05-07T20:33:33.0429957Z x1 = x[:, D:] 2025-05-07T20:33:33.0430030Z 2025-05-07T20:33:33.0430115Z if contiguous: 2025-05-07T20:33:33.0430206Z x0 = x0.contiguous() 2025-05-07T20:33:33.0430294Z x1 = x1.contiguous() 2025-05-07T20:33:33.0430369Z 2025-05-07T20:33:33.0430459Z if scale_ub is not None: 2025-05-07T20:33:33.0430564Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:33.0430699Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:33.0430775Z ) 2025-05-07T20:33:33.0430853Z else: 2025-05-07T20:33:33.0430945Z scale_ub_tensor = None 2025-05-07T20:33:33.0431021Z 2025-05-07T20:33:33.0431152Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:33.0431241Z op = silu_mul_quant 2025-05-07T20:33:33.0431325Z if compiled: 2025-05-07T20:33:33.0431427Z op = torch.compile(op) 2025-05-07T20:33:33.0431532Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:33.0431603Z 2025-05-07T20:33:33.0431696Z > y_fp8, y_scale = fn() 2025-05-07T20:33:33.0431700Z 2025-05-07T20:33:33.0431795Z moe/activation_test.py:117: 2025-05-07T20:33:33.0431930Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:33.0432027Z moe/activation_test.py:115: in fn 2025-05-07T20:33:33.0432130Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:33.0432671Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:33.0432765Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:33.0433151Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:33.0433432Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:33.0433792Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:33.0433886Z kernel = self.compile( 2025-05-07T20:33:33.0434297Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:33.0434474Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:33.0434604Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:33.0434649Z 2025-05-07T20:33:33.0434857Z self = 2025-05-07T20:33:33.0435706Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:33.0436254Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f58fdcfd670>} 2025-05-07T20:33:33.0437105Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:33.0437302Z context = 2025-05-07T20:33:33.0437308Z 2025-05-07T20:33:33.0437474Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:33.0437753Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:33.0437901Z module_map=module_map) 2025-05-07T20:33:33.0438060Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:33.0438159Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:33.0438236Z E ^ 2025-05-07T20:33:33.0438618Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:33.0438622Z 2025-05-07T20:33:33.0439066Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:33.0439073Z 2025-05-07T20:33:33.0439172Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:33.0439401Z self=, 2025-05-07T20:33:33.0439482Z T=2048, 2025-05-07T20:33:33.0439556Z D=7168, 2025-05-07T20:33:33.0439635Z scale_ub=None, 2025-05-07T20:33:33.0439719Z contiguous=False, 2025-05-07T20:33:33.0439802Z compiled=True, 2025-05-07T20:33:33.0439875Z ) 2025-05-07T20:33:33.0440098Z self = 2025-05-07T20:33:33.0440277Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:33.0440284Z 2025-05-07T20:33:33.0440359Z @given( 2025-05-07T20:33:33.0440475Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:33.0440577Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:33.0440690Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:33.0440808Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:33.0440920Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:33.0440993Z ) 2025-05-07T20:33:33.0441254Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:33.0441344Z def test_silu_mul_quant( 2025-05-07T20:33:33.0441418Z self, 2025-05-07T20:33:33.0441494Z T: int, 2025-05-07T20:33:33.0441566Z D: int, 2025-05-07T20:33:33.0441667Z scale_ub: Optional[float], 2025-05-07T20:33:33.0441760Z contiguous: bool, 2025-05-07T20:33:33.0441841Z compiled: bool, 2025-05-07T20:33:33.0441914Z ) -> None: 2025-05-07T20:33:33.0442060Z torch.manual_seed(2025) 2025-05-07T20:33:33.0442131Z 2025-05-07T20:33:33.0442300Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:33.0442377Z 2025-05-07T20:33:33.0442465Z x_sign = torch.sign(x) 2025-05-07T20:33:33.0442595Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:33.0442683Z x = x_sign * x_clamp 2025-05-07T20:33:33.0442759Z x0 = x[:, :D] 2025-05-07T20:33:33.0442836Z x1 = x[:, D:] 2025-05-07T20:33:33.0442949Z 2025-05-07T20:33:33.0443028Z if contiguous: 2025-05-07T20:33:33.0443120Z x0 = x0.contiguous() 2025-05-07T20:33:33.0443208Z x1 = x1.contiguous() 2025-05-07T20:33:33.0443279Z 2025-05-07T20:33:33.0443373Z if scale_ub is not None: 2025-05-07T20:33:33.0443475Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:33.0443609Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:33.0443687Z ) 2025-05-07T20:33:33.0443758Z else: 2025-05-07T20:33:33.0443857Z scale_ub_tensor = None 2025-05-07T20:33:33.0443928Z 2025-05-07T20:33:33.0444060Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:33.0444150Z op = silu_mul_quant 2025-05-07T20:33:33.0444272Z if compiled: 2025-05-07T20:33:33.0444373Z op = torch.compile(op) 2025-05-07T20:33:33.0444476Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:33.0444545Z 2025-05-07T20:33:33.0444638Z > y_fp8, y_scale = fn() 2025-05-07T20:33:33.0444643Z 2025-05-07T20:33:33.0444742Z moe/activation_test.py:117: 2025-05-07T20:33:33.0444873Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:33.0445015Z moe/activation_test.py:115: in fn 2025-05-07T20:33:33.0445115Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:33.0445510Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:33.0445603Z return fn(*args, **kwargs) 2025-05-07T20:33:33.0446146Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:33.0446242Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:33.0446625Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:33.0446862Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:33.0447226Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:33.0447319Z kernel = self.compile( 2025-05-07T20:33:33.0447728Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:33.0447904Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:33.0448038Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:33.0448042Z 2025-05-07T20:33:33.0448251Z self = 2025-05-07T20:33:33.0449131Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:33.0449697Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f58fdc6f550>} 2025-05-07T20:33:33.0450508Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:33.0450748Z context = 2025-05-07T20:33:33.0450753Z 2025-05-07T20:33:33.0450920Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:33.0451199Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:33.0451306Z module_map=module_map) 2025-05-07T20:33:33.0451468Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:33.0451565Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:33.0451681Z E ^ 2025-05-07T20:33:33.0452063Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:33.0452068Z 2025-05-07T20:33:33.0452510Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:33.0452518Z 2025-05-07T20:33:33.0452617Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:33.0452851Z self=, 2025-05-07T20:33:33.0452926Z T=4096, 2025-05-07T20:33:33.0453001Z D=7168, 2025-05-07T20:33:33.0453085Z scale_ub=None, 2025-05-07T20:33:33.0453168Z contiguous=False, 2025-05-07T20:33:33.0453251Z compiled=True, 2025-05-07T20:33:33.0453322Z ) 2025-05-07T20:33:33.0453584Z self = 2025-05-07T20:33:33.0453765Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:33.0453773Z 2025-05-07T20:33:33.0453847Z @given( 2025-05-07T20:33:33.0453964Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:33.0454063Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:33.0454176Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:33.0454351Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:33.0454464Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:33.0454540Z ) 2025-05-07T20:33:33.0454798Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:33.0454891Z def test_silu_mul_quant( 2025-05-07T20:33:33.0454964Z self, 2025-05-07T20:33:33.0455040Z T: int, 2025-05-07T20:33:33.0455112Z D: int, 2025-05-07T20:33:33.0455213Z scale_ub: Optional[float], 2025-05-07T20:33:33.0455301Z contiguous: bool, 2025-05-07T20:33:33.0455384Z compiled: bool, 2025-05-07T20:33:33.0455458Z ) -> None: 2025-05-07T20:33:33.0455562Z torch.manual_seed(2025) 2025-05-07T20:33:33.0455636Z 2025-05-07T20:33:33.0455803Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:33.0455878Z 2025-05-07T20:33:33.0455972Z x_sign = torch.sign(x) 2025-05-07T20:33:33.0456098Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:33.0456185Z x = x_sign * x_clamp 2025-05-07T20:33:33.0456264Z x0 = x[:, :D] 2025-05-07T20:33:33.0456344Z x1 = x[:, D:] 2025-05-07T20:33:33.0456413Z 2025-05-07T20:33:33.0456493Z if contiguous: 2025-05-07T20:33:33.0456587Z x0 = x0.contiguous() 2025-05-07T20:33:33.0456675Z x1 = x1.contiguous() 2025-05-07T20:33:33.0456746Z 2025-05-07T20:33:33.0456837Z if scale_ub is not None: 2025-05-07T20:33:33.0456941Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:33.0457075Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:33.0457155Z ) 2025-05-07T20:33:33.0457231Z else: 2025-05-07T20:33:33.0457327Z scale_ub_tensor = None 2025-05-07T20:33:33.0457396Z 2025-05-07T20:33:33.0457522Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:33.0457614Z op = silu_mul_quant 2025-05-07T20:33:33.0457696Z if compiled: 2025-05-07T20:33:33.0457794Z op = torch.compile(op) 2025-05-07T20:33:33.0457944Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:33.0458018Z 2025-05-07T20:33:33.0458107Z > y_fp8, y_scale = fn() 2025-05-07T20:33:33.0458111Z 2025-05-07T20:33:33.0458206Z moe/activation_test.py:117: 2025-05-07T20:33:33.0458335Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:33.0458435Z moe/activation_test.py:115: in fn 2025-05-07T20:33:33.0458535Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:33.0458930Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:33.0459096Z return fn(*args, **kwargs) 2025-05-07T20:33:33.0459662Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:33.0459759Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:33.0460142Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:33.0460377Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:33.0460744Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:33.0460835Z kernel = self.compile( 2025-05-07T20:33:33.0461281Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:33.0461463Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:33.0461591Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:33.0461596Z 2025-05-07T20:33:33.0461803Z self = 2025-05-07T20:33:33.0462692Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:33.0463232Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f58fd965160>} 2025-05-07T20:33:33.0464046Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:33.0464240Z context = 2025-05-07T20:33:33.0464247Z 2025-05-07T20:33:33.0464416Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:33.0464691Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:33.0464798Z module_map=module_map) 2025-05-07T20:33:33.0464962Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:33.0465061Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:33.0465135Z E ^ 2025-05-07T20:33:33.0465518Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:33.0465522Z 2025-05-07T20:33:33.0465967Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:33.0465972Z 2025-05-07T20:33:33.0466076Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:33.0466307Z self=, 2025-05-07T20:33:33.0466385Z T=16384, 2025-05-07T20:33:33.0466461Z D=5120, 2025-05-07T20:33:33.0466542Z scale_ub=1200.0, 2025-05-07T20:33:33.0466628Z contiguous=False, 2025-05-07T20:33:33.0466716Z compiled=False, 2025-05-07T20:33:33.0466787Z ) 2025-05-07T20:33:33.0467013Z self = 2025-05-07T20:33:33.0467248Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:33.0467253Z 2025-05-07T20:33:33.0467330Z @given( 2025-05-07T20:33:33.0467449Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:33.0467544Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:33.0467659Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:33.0467779Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:33.0467889Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:33.0468001Z ) 2025-05-07T20:33:33.0468258Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:33.0468349Z def test_silu_mul_quant( 2025-05-07T20:33:33.0468425Z self, 2025-05-07T20:33:33.0468502Z T: int, 2025-05-07T20:33:33.0468576Z D: int, 2025-05-07T20:33:33.0468677Z scale_ub: Optional[float], 2025-05-07T20:33:33.0468761Z contiguous: bool, 2025-05-07T20:33:33.0468847Z compiled: bool, 2025-05-07T20:33:33.0468926Z ) -> None: 2025-05-07T20:33:33.0469019Z torch.manual_seed(2025) 2025-05-07T20:33:33.0469090Z 2025-05-07T20:33:33.0469266Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:33.0469339Z 2025-05-07T20:33:33.0469428Z x_sign = torch.sign(x) 2025-05-07T20:33:33.0469596Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:33.0469684Z x = x_sign * x_clamp 2025-05-07T20:33:33.0469839Z x0 = x[:, :D] 2025-05-07T20:33:33.0469921Z x1 = x[:, D:] 2025-05-07T20:33:33.0469991Z 2025-05-07T20:33:33.0470075Z if contiguous: 2025-05-07T20:33:33.0470165Z x0 = x0.contiguous() 2025-05-07T20:33:33.0470255Z x1 = x1.contiguous() 2025-05-07T20:33:33.0470376Z 2025-05-07T20:33:33.0470464Z if scale_ub is not None: 2025-05-07T20:33:33.0470567Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:33.0470707Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:33.0470780Z ) 2025-05-07T20:33:33.0470855Z else: 2025-05-07T20:33:33.0470951Z scale_ub_tensor = None 2025-05-07T20:33:33.0471022Z 2025-05-07T20:33:33.0471148Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:33.0471244Z op = silu_mul_quant 2025-05-07T20:33:33.0471326Z if compiled: 2025-05-07T20:33:33.0471428Z op = torch.compile(op) 2025-05-07T20:33:33.0471530Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:33.0471602Z 2025-05-07T20:33:33.0471692Z > y_fp8, y_scale = fn() 2025-05-07T20:33:33.0471696Z 2025-05-07T20:33:33.0471791Z moe/activation_test.py:117: 2025-05-07T20:33:33.0471920Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:33.0472026Z moe/activation_test.py:115: in fn 2025-05-07T20:33:33.0472124Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:33.0472669Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:33.0472765Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:33.0473148Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:33.0473385Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:33.0473746Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:33.0473837Z kernel = self.compile( 2025-05-07T20:33:33.0474247Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:33.0474426Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:33.0474556Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:33.0474560Z 2025-05-07T20:33:33.0474812Z self = 2025-05-07T20:33:33.0475656Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:33.0476201Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f58fd965940>} 2025-05-07T20:33:33.0477053Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:33.0477256Z context = 2025-05-07T20:33:33.0477261Z 2025-05-07T20:33:33.0477431Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:33.0477707Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:33.0477810Z module_map=module_map) 2025-05-07T20:33:33.0477971Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:33.0478112Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:33.0478190Z E ^ 2025-05-07T20:33:33.0478569Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:33.0478576Z 2025-05-07T20:33:33.0479022Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:33.0479027Z 2025-05-07T20:33:33.0479170Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:33.0479404Z self=, 2025-05-07T20:33:33.0479483Z T=16384, 2025-05-07T20:33:33.0479560Z D=5120, 2025-05-07T20:33:33.0479646Z scale_ub=1200.0, 2025-05-07T20:33:33.0479731Z contiguous=True, 2025-05-07T20:33:33.0479810Z compiled=True, 2025-05-07T20:33:33.0479888Z ) 2025-05-07T20:33:33.0480110Z self = 2025-05-07T20:33:33.0480292Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:33.0480296Z 2025-05-07T20:33:33.0480376Z @given( 2025-05-07T20:33:33.0480492Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:33.0480592Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:33.0480707Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:33.0480821Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:33.0480939Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:33.0481013Z ) 2025-05-07T20:33:33.0481273Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:33.0481372Z def test_silu_mul_quant( 2025-05-07T20:33:33.0481444Z self, 2025-05-07T20:33:33.0481517Z T: int, 2025-05-07T20:33:33.0481595Z D: int, 2025-05-07T20:33:33.0481689Z scale_ub: Optional[float], 2025-05-07T20:33:33.0481775Z contiguous: bool, 2025-05-07T20:33:33.0481864Z compiled: bool, 2025-05-07T20:33:33.0481942Z ) -> None: 2025-05-07T20:33:33.0482039Z torch.manual_seed(2025) 2025-05-07T20:33:33.0482112Z 2025-05-07T20:33:33.0482281Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:33.0482360Z 2025-05-07T20:33:33.0482450Z x_sign = torch.sign(x) 2025-05-07T20:33:33.0482572Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:33.0482661Z x = x_sign * x_clamp 2025-05-07T20:33:33.0482929Z x0 = x[:, :D] 2025-05-07T20:33:33.0483050Z x1 = x[:, D:] 2025-05-07T20:33:33.0483156Z 2025-05-07T20:33:33.0483239Z if contiguous: 2025-05-07T20:33:33.0483414Z x0 = x0.contiguous() 2025-05-07T20:33:33.0483508Z x1 = x1.contiguous() 2025-05-07T20:33:33.0483580Z 2025-05-07T20:33:33.0483674Z if scale_ub is not None: 2025-05-07T20:33:33.0483782Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:33.0483928Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:33.0484004Z ) 2025-05-07T20:33:33.0484078Z else: 2025-05-07T20:33:33.0484172Z scale_ub_tensor = None 2025-05-07T20:33:33.0484304Z 2025-05-07T20:33:33.0484430Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:33.0484518Z op = silu_mul_quant 2025-05-07T20:33:33.0484602Z if compiled: 2025-05-07T20:33:33.0484698Z op = torch.compile(op) 2025-05-07T20:33:33.0484801Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:33.0484871Z 2025-05-07T20:33:33.0484956Z > y_fp8, y_scale = fn() 2025-05-07T20:33:33.0484961Z 2025-05-07T20:33:33.0485062Z moe/activation_test.py:117: 2025-05-07T20:33:33.0485190Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:33.0485287Z moe/activation_test.py:115: in fn 2025-05-07T20:33:33.0485383Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:33.0485865Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:33.0485957Z return fn(*args, **kwargs) 2025-05-07T20:33:33.0486496Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:33.0486596Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:33.0486979Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:33.0487279Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:33.0487641Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:33.0487732Z kernel = self.compile( 2025-05-07T20:33:33.0488140Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:33.0488320Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:33.0488451Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:33.0488455Z 2025-05-07T20:33:33.0488663Z self = 2025-05-07T20:33:33.0489561Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:33.0490107Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f58fda0d550>} 2025-05-07T20:33:33.0490918Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:33.0491111Z context = 2025-05-07T20:33:33.0491116Z 2025-05-07T20:33:33.0491280Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:33.0491559Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:33.0491661Z module_map=module_map) 2025-05-07T20:33:33.0491827Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:33.0491921Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:33.0491992Z E ^ 2025-05-07T20:33:33.0492416Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:33.0492421Z 2025-05-07T20:33:33.0492864Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:33.0492869Z 2025-05-07T20:33:33.0492970Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:33.0493199Z self=, 2025-05-07T20:33:33.0493275Z T=16384, 2025-05-07T20:33:33.0493347Z D=5120, 2025-05-07T20:33:33.0493464Z scale_ub=None, 2025-05-07T20:33:33.0493548Z contiguous=False, 2025-05-07T20:33:33.0493634Z compiled=True, 2025-05-07T20:33:33.0493704Z ) 2025-05-07T20:33:33.0493926Z self = 2025-05-07T20:33:33.0494110Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:33.0494115Z 2025-05-07T20:33:33.0494188Z @given( 2025-05-07T20:33:33.0494306Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:33.0494406Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:33.0494517Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:33.0494633Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:33.0494781Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:33.0494853Z ) 2025-05-07T20:33:33.0495110Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:33.0495200Z def test_silu_mul_quant( 2025-05-07T20:33:33.0495272Z self, 2025-05-07T20:33:33.0495348Z T: int, 2025-05-07T20:33:33.0495421Z D: int, 2025-05-07T20:33:33.0495516Z scale_ub: Optional[float], 2025-05-07T20:33:33.0495647Z contiguous: bool, 2025-05-07T20:33:33.0495728Z compiled: bool, 2025-05-07T20:33:33.0495802Z ) -> None: 2025-05-07T20:33:33.0495896Z torch.manual_seed(2025) 2025-05-07T20:33:33.0495964Z 2025-05-07T20:33:33.0496139Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:33.0496210Z 2025-05-07T20:33:33.0496298Z x_sign = torch.sign(x) 2025-05-07T20:33:33.0496420Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:33.0496505Z x = x_sign * x_clamp 2025-05-07T20:33:33.0496582Z x0 = x[:, :D] 2025-05-07T20:33:33.0496661Z x1 = x[:, D:] 2025-05-07T20:33:33.0496731Z 2025-05-07T20:33:33.0496812Z if contiguous: 2025-05-07T20:33:33.0496907Z x0 = x0.contiguous() 2025-05-07T20:33:33.0496993Z x1 = x1.contiguous() 2025-05-07T20:33:33.0497063Z 2025-05-07T20:33:33.0497152Z if scale_ub is not None: 2025-05-07T20:33:33.0497253Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:33.0497391Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:33.0497463Z ) 2025-05-07T20:33:33.0497532Z else: 2025-05-07T20:33:33.0497625Z scale_ub_tensor = None 2025-05-07T20:33:33.0497697Z 2025-05-07T20:33:33.0497824Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:33.0497915Z op = silu_mul_quant 2025-05-07T20:33:33.0497994Z if compiled: 2025-05-07T20:33:33.0498092Z op = torch.compile(op) 2025-05-07T20:33:33.0498197Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:33.0498268Z 2025-05-07T20:33:33.0498354Z > y_fp8, y_scale = fn() 2025-05-07T20:33:33.0498358Z 2025-05-07T20:33:33.0498459Z moe/activation_test.py:117: 2025-05-07T20:33:33.0498589Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:33.0498691Z moe/activation_test.py:115: in fn 2025-05-07T20:33:33.0498788Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:33.0499179Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:33.0499318Z return fn(*args, **kwargs) 2025-05-07T20:33:33.0499855Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:33.0499948Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:33.0500332Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:33.0500565Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:33.0500927Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:33.0501054Z kernel = self.compile( 2025-05-07T20:33:33.0501460Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:33.0501644Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:33.0501775Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:33.0501780Z 2025-05-07T20:33:33.0501993Z self = 2025-05-07T20:33:33.0502881Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:33.0503424Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f58fd7b71f0>} 2025-05-07T20:33:33.0504236Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:33.0504469Z context = 2025-05-07T20:33:33.0504473Z 2025-05-07T20:33:33.0504642Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:33.0504914Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:33.0505016Z module_map=module_map) 2025-05-07T20:33:33.0505179Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:33.0505274Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:33.0505347Z E ^ 2025-05-07T20:33:33.0505726Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:33.0505733Z 2025-05-07T20:33:33.0506176Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:33.0506184Z 2025-05-07T20:33:33.0506285Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:33.0506514Z self=, 2025-05-07T20:33:33.0506586Z T=2048, 2025-05-07T20:33:33.0506664Z D=5120, 2025-05-07T20:33:33.0506742Z scale_ub=None, 2025-05-07T20:33:33.0506830Z contiguous=False, 2025-05-07T20:33:33.0506907Z compiled=True, 2025-05-07T20:33:33.0506975Z ) 2025-05-07T20:33:33.0507198Z self = 2025-05-07T20:33:33.0507379Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:33.0507383Z 2025-05-07T20:33:33.0507455Z @given( 2025-05-07T20:33:33.0507580Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:33.0507676Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:33.0507785Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:33.0507900Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:33.0508010Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:33.0508085Z ) 2025-05-07T20:33:33.0508385Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:33.0508478Z def test_silu_mul_quant( 2025-05-07T20:33:33.0508557Z self, 2025-05-07T20:33:33.0508628Z T: int, 2025-05-07T20:33:33.0508700Z D: int, 2025-05-07T20:33:33.0508800Z scale_ub: Optional[float], 2025-05-07T20:33:33.0508881Z contiguous: bool, 2025-05-07T20:33:33.0508968Z compiled: bool, 2025-05-07T20:33:33.0509048Z ) -> None: 2025-05-07T20:33:33.0509139Z torch.manual_seed(2025) 2025-05-07T20:33:33.0509209Z 2025-05-07T20:33:33.0513071Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:33.0513162Z 2025-05-07T20:33:33.0513259Z x_sign = torch.sign(x) 2025-05-07T20:33:33.0513393Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:33.0513486Z x = x_sign * x_clamp 2025-05-07T20:33:33.0513570Z x0 = x[:, :D] 2025-05-07T20:33:33.0513650Z x1 = x[:, D:] 2025-05-07T20:33:33.0513724Z 2025-05-07T20:33:33.0513813Z if contiguous: 2025-05-07T20:33:33.0513905Z x0 = x0.contiguous() 2025-05-07T20:33:33.0513995Z x1 = x1.contiguous() 2025-05-07T20:33:33.0514075Z 2025-05-07T20:33:33.0514165Z if scale_ub is not None: 2025-05-07T20:33:33.0514273Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:33.0514478Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:33.0514556Z ) 2025-05-07T20:33:33.0514633Z else: 2025-05-07T20:33:33.0514736Z scale_ub_tensor = None 2025-05-07T20:33:33.0514810Z 2025-05-07T20:33:33.0514944Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:33.0515040Z op = silu_mul_quant 2025-05-07T20:33:33.0515127Z if compiled: 2025-05-07T20:33:33.0515276Z op = torch.compile(op) 2025-05-07T20:33:33.0515381Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:33.0515454Z 2025-05-07T20:33:33.0515550Z > y_fp8, y_scale = fn() 2025-05-07T20:33:33.0515555Z 2025-05-07T20:33:33.0515652Z moe/activation_test.py:117: 2025-05-07T20:33:33.0515786Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:33.0515896Z moe/activation_test.py:115: in fn 2025-05-07T20:33:33.0516002Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:33.0516409Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:33.0516510Z return fn(*args, **kwargs) 2025-05-07T20:33:33.0517053Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:33.0517153Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:33.0517539Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:33.0517775Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:33.0518146Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:33.0518238Z kernel = self.compile( 2025-05-07T20:33:33.0518655Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:33.0518862Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:33.0519017Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:33.0519025Z 2025-05-07T20:33:33.0519244Z self = 2025-05-07T20:33:33.0520097Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:33.0520698Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f58fd7b7f70>} 2025-05-07T20:33:33.0521521Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:33.0521719Z context = 2025-05-07T20:33:33.0521723Z 2025-05-07T20:33:33.0521962Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:33.0522241Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:33.0522353Z module_map=module_map) 2025-05-07T20:33:33.0522521Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:33.0522618Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:33.0522700Z E ^ 2025-05-07T20:33:33.0523091Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:33.0523096Z 2025-05-07T20:33:33.0523548Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:33.0523552Z 2025-05-07T20:33:33.0523706Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:33.0523939Z self=, 2025-05-07T20:33:33.0524025Z T=2048, 2025-05-07T20:33:33.0524103Z D=5120, 2025-05-07T20:33:33.0524188Z scale_ub=1200.0, 2025-05-07T20:33:33.0524281Z contiguous=False, 2025-05-07T20:33:33.0524365Z compiled=True, 2025-05-07T20:33:33.0524439Z ) 2025-05-07T20:33:33.0524714Z self = 2025-05-07T20:33:33.0524897Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:33.0524901Z 2025-05-07T20:33:33.0524984Z @given( 2025-05-07T20:33:33.0525103Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:33.0525200Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:33.0525319Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:33.0525439Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:33.0525552Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:33.0525631Z ) 2025-05-07T20:33:33.0525894Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:33.0525991Z def test_silu_mul_quant( 2025-05-07T20:33:33.0526069Z self, 2025-05-07T20:33:33.0526148Z T: int, 2025-05-07T20:33:33.0526224Z D: int, 2025-05-07T20:33:33.0526326Z scale_ub: Optional[float], 2025-05-07T20:33:33.0526421Z contiguous: bool, 2025-05-07T20:33:33.0526506Z compiled: bool, 2025-05-07T20:33:33.0526586Z ) -> None: 2025-05-07T20:33:33.0526688Z torch.manual_seed(2025) 2025-05-07T20:33:33.0526766Z 2025-05-07T20:33:33.0526941Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:33.0527019Z 2025-05-07T20:33:33.0527112Z x_sign = torch.sign(x) 2025-05-07T20:33:33.0527237Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:33.0527331Z x = x_sign * x_clamp 2025-05-07T20:33:33.0527412Z x0 = x[:, :D] 2025-05-07T20:33:33.0527498Z x1 = x[:, D:] 2025-05-07T20:33:33.0527571Z 2025-05-07T20:33:33.0527656Z if contiguous: 2025-05-07T20:33:33.0527753Z x0 = x0.contiguous() 2025-05-07T20:33:33.0527842Z x1 = x1.contiguous() 2025-05-07T20:33:33.0527916Z 2025-05-07T20:33:33.0528015Z if scale_ub is not None: 2025-05-07T20:33:33.0528125Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:33.0528261Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:33.0528342Z ) 2025-05-07T20:33:33.0528465Z else: 2025-05-07T20:33:33.0528561Z scale_ub_tensor = None 2025-05-07T20:33:33.0528635Z 2025-05-07T20:33:33.0528766Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:33.0528858Z op = silu_mul_quant 2025-05-07T20:33:33.0528943Z if compiled: 2025-05-07T20:33:33.0529067Z op = torch.compile(op) 2025-05-07T20:33:33.0529185Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:33.0529275Z 2025-05-07T20:33:33.0529366Z > y_fp8, y_scale = fn() 2025-05-07T20:33:33.0529411Z 2025-05-07T20:33:33.0529513Z moe/activation_test.py:117: 2025-05-07T20:33:33.0529647Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:33.0529752Z moe/activation_test.py:115: in fn 2025-05-07T20:33:33.0529856Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:33.0530255Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:33.0530353Z return fn(*args, **kwargs) 2025-05-07T20:33:33.0530893Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:33.0530989Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:33.0531414Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:33.0531651Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:33.0532019Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:33.0532114Z kernel = self.compile( 2025-05-07T20:33:33.0532526Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:33.0532748Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:33.0532881Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:33.0532886Z 2025-05-07T20:33:33.0533098Z self = 2025-05-07T20:33:33.0533951Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:33.0534496Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f58fd738940>} 2025-05-07T20:33:33.0535316Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:33.0535515Z context = 2025-05-07T20:33:33.0535521Z 2025-05-07T20:33:33.0535694Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:33.0535971Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:33.0536077Z module_map=module_map) 2025-05-07T20:33:33.0536246Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:33.0536343Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:33.0536420Z E ^ 2025-05-07T20:33:33.0536810Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:33.0536815Z 2025-05-07T20:33:33.0537263Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:33.0537271Z 2025-05-07T20:33:33.0537379Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:33.0537652Z self=, 2025-05-07T20:33:33.0537730Z T=4096, 2025-05-07T20:33:33.0537811Z D=5120, 2025-05-07T20:33:33.0537894Z scale_ub=1200.0, 2025-05-07T20:33:33.0537979Z contiguous=True, 2025-05-07T20:33:33.0538067Z compiled=True, 2025-05-07T20:33:33.0538144Z ) 2025-05-07T20:33:33.0538372Z self = 2025-05-07T20:33:33.0538554Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:33.0538559Z 2025-05-07T20:33:33.0538680Z @given( 2025-05-07T20:33:33.0538807Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:33.0538907Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:33.0539047Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:33.0539191Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:33.0539316Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:33.0539393Z ) 2025-05-07T20:33:33.0539663Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:33.0539758Z def test_silu_mul_quant( 2025-05-07T20:33:33.0539836Z self, 2025-05-07T20:33:33.0539915Z T: int, 2025-05-07T20:33:33.0539990Z D: int, 2025-05-07T20:33:33.0540134Z scale_ub: Optional[float], 2025-05-07T20:33:33.0540225Z contiguous: bool, 2025-05-07T20:33:33.0540310Z compiled: bool, 2025-05-07T20:33:33.0540392Z ) -> None: 2025-05-07T20:33:33.0540486Z torch.manual_seed(2025) 2025-05-07T20:33:33.0540564Z 2025-05-07T20:33:33.0540739Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:33.0540813Z 2025-05-07T20:33:33.0540906Z x_sign = torch.sign(x) 2025-05-07T20:33:33.0541082Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:33.0541172Z x = x_sign * x_clamp 2025-05-07T20:33:33.0541253Z x0 = x[:, :D] 2025-05-07T20:33:33.0541337Z x1 = x[:, D:] 2025-05-07T20:33:33.0541411Z 2025-05-07T20:33:33.0541497Z if contiguous: 2025-05-07T20:33:33.0541588Z x0 = x0.contiguous() 2025-05-07T20:33:33.0541678Z x1 = x1.contiguous() 2025-05-07T20:33:33.0541755Z 2025-05-07T20:33:33.0541849Z if scale_ub is not None: 2025-05-07T20:33:33.0541958Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:33.0542098Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:33.0542174Z ) 2025-05-07T20:33:33.0542253Z else: 2025-05-07T20:33:33.0542355Z scale_ub_tensor = None 2025-05-07T20:33:33.0542432Z 2025-05-07T20:33:33.0542563Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:33.0542656Z op = silu_mul_quant 2025-05-07T20:33:33.0542745Z if compiled: 2025-05-07T20:33:33.0542847Z op = torch.compile(op) 2025-05-07T20:33:33.0542958Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:33.0543033Z 2025-05-07T20:33:33.0543128Z > y_fp8, y_scale = fn() 2025-05-07T20:33:33.0543133Z 2025-05-07T20:33:33.0543231Z moe/activation_test.py:117: 2025-05-07T20:33:33.0543364Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:33.0543471Z moe/activation_test.py:115: in fn 2025-05-07T20:33:33.0543573Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:33.0543969Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:33.0544068Z return fn(*args, **kwargs) 2025-05-07T20:33:33.0544607Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:33.0544713Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:33.0545096Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:33.0545381Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:33.0545750Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:33.0545844Z kernel = self.compile( 2025-05-07T20:33:33.0546259Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:33.0546443Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:33.0546575Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:33.0546619Z 2025-05-07T20:33:33.0546832Z self = 2025-05-07T20:33:33.0547681Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:33.0548235Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f58fd705790>} 2025-05-07T20:33:33.0549093Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:33.0549291Z context = 2025-05-07T20:33:33.0549297Z 2025-05-07T20:33:33.0549470Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:33.0549747Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:33.0549975Z module_map=module_map) 2025-05-07T20:33:33.0550137Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:33.0550238Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:33.0550321Z E ^ 2025-05-07T20:33:33.0550705Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:33.0550709Z 2025-05-07T20:33:33.0551159Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:33.0551166Z 2025-05-07T20:33:33.0551268Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:33.0551500Z self=, 2025-05-07T20:33:33.0551583Z T=128, 2025-05-07T20:33:33.0551660Z D=5120, 2025-05-07T20:33:33.0551742Z scale_ub=1200.0, 2025-05-07T20:33:33.0551830Z contiguous=False, 2025-05-07T20:33:33.0551917Z compiled=True, 2025-05-07T20:33:33.0551991Z ) 2025-05-07T20:33:33.0552219Z self = 2025-05-07T20:33:33.0552399Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:33.0552404Z 2025-05-07T20:33:33.0552484Z @given( 2025-05-07T20:33:33.0552606Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:33.0552706Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:33.0552825Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:33.0552947Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:33.0553060Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:33.0553137Z ) 2025-05-07T20:33:33.0553399Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:33.0553493Z def test_silu_mul_quant( 2025-05-07T20:33:33.0553572Z self, 2025-05-07T20:33:33.0553647Z T: int, 2025-05-07T20:33:33.0553726Z D: int, 2025-05-07T20:33:33.0553826Z scale_ub: Optional[float], 2025-05-07T20:33:33.0553915Z contiguous: bool, 2025-05-07T20:33:33.0554002Z compiled: bool, 2025-05-07T20:33:33.0554154Z ) -> None: 2025-05-07T20:33:33.0554248Z torch.manual_seed(2025) 2025-05-07T20:33:33.0554324Z 2025-05-07T20:33:33.0554495Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:33.0554569Z 2025-05-07T20:33:33.0554662Z x_sign = torch.sign(x) 2025-05-07T20:33:33.0554791Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:33.0554882Z x = x_sign * x_clamp 2025-05-07T20:33:33.0554969Z x0 = x[:, :D] 2025-05-07T20:33:33.0555048Z x1 = x[:, D:] 2025-05-07T20:33:33.0555162Z 2025-05-07T20:33:33.0555248Z if contiguous: 2025-05-07T20:33:33.0555340Z x0 = x0.contiguous() 2025-05-07T20:33:33.0555432Z x1 = x1.contiguous() 2025-05-07T20:33:33.0555510Z 2025-05-07T20:33:33.0555603Z if scale_ub is not None: 2025-05-07T20:33:33.0555712Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:33.0555848Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:33.0555925Z ) 2025-05-07T20:33:33.0556003Z else: 2025-05-07T20:33:33.0556099Z scale_ub_tensor = None 2025-05-07T20:33:33.0556172Z 2025-05-07T20:33:33.0556304Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:33.0556394Z op = silu_mul_quant 2025-05-07T20:33:33.0556517Z if compiled: 2025-05-07T20:33:33.0556622Z op = torch.compile(op) 2025-05-07T20:33:33.0556727Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:33.0556802Z 2025-05-07T20:33:33.0556896Z > y_fp8, y_scale = fn() 2025-05-07T20:33:33.0556900Z 2025-05-07T20:33:33.0556998Z moe/activation_test.py:117: 2025-05-07T20:33:33.0557134Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:33.0557277Z moe/activation_test.py:115: in fn 2025-05-07T20:33:33.0557377Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:33.0557782Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:33.0557876Z return fn(*args, **kwargs) 2025-05-07T20:33:33.0558419Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:33.0558526Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:33.0558921Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:33.0559163Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:33.0559530Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:33.0559625Z kernel = self.compile( 2025-05-07T20:33:33.0560042Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:33.0560223Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:33.0560360Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:33.0560365Z 2025-05-07T20:33:33.0560575Z self = 2025-05-07T20:33:33.0561426Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:33.0561981Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f58fd5cc0d0>} 2025-05-07T20:33:33.0562795Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:33.0563039Z context = 2025-05-07T20:33:33.0563044Z 2025-05-07T20:33:33.0563216Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:33.0563492Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:33.0563603Z module_map=module_map) 2025-05-07T20:33:33.0563765Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:33.0563865Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:33.0563988Z E ^ 2025-05-07T20:33:33.0564373Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:33.0564378Z 2025-05-07T20:33:33.0564832Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:33.0564837Z 2025-05-07T20:33:33.0564939Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:33.0565177Z self=, 2025-05-07T20:33:33.0565256Z T=16384, 2025-05-07T20:33:33.0565334Z D=7168, 2025-05-07T20:33:33.0565421Z scale_ub=1200.0, 2025-05-07T20:33:33.0565511Z contiguous=True, 2025-05-07T20:33:33.0565595Z compiled=True, 2025-05-07T20:33:33.0565712Z ) 2025-05-07T20:33:33.0565943Z self = 2025-05-07T20:33:33.0566123Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:33.0566130Z 2025-05-07T20:33:33.0566209Z @given( 2025-05-07T20:33:33.0566330Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:33.0566431Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:33.0566589Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:33.0566705Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:33.0566824Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:33.0566899Z ) 2025-05-07T20:33:33.0567161Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:33.0567260Z def test_silu_mul_quant( 2025-05-07T20:33:33.0567336Z self, 2025-05-07T20:33:33.0567413Z T: int, 2025-05-07T20:33:33.0567496Z D: int, 2025-05-07T20:33:33.0567598Z scale_ub: Optional[float], 2025-05-07T20:33:33.0567687Z contiguous: bool, 2025-05-07T20:33:33.0567775Z compiled: bool, 2025-05-07T20:33:33.0567856Z ) -> None: 2025-05-07T20:33:33.0567955Z torch.manual_seed(2025) 2025-05-07T20:33:33.0568028Z 2025-05-07T20:33:33.0568200Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:33.0568280Z 2025-05-07T20:33:33.0568371Z x_sign = torch.sign(x) 2025-05-07T20:33:33.0568496Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:33.0568590Z x = x_sign * x_clamp 2025-05-07T20:33:33.0568673Z x0 = x[:, :D] 2025-05-07T20:33:33.0568753Z x1 = x[:, D:] 2025-05-07T20:33:33.0568832Z 2025-05-07T20:33:33.0568916Z if contiguous: 2025-05-07T20:33:33.0569030Z x0 = x0.contiguous() 2025-05-07T20:33:33.0569132Z x1 = x1.contiguous() 2025-05-07T20:33:33.0569226Z 2025-05-07T20:33:33.0569321Z if scale_ub is not None: 2025-05-07T20:33:33.0569431Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:33.0569567Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:33.0569650Z ) 2025-05-07T20:33:33.0569725Z else: 2025-05-07T20:33:33.0569817Z scale_ub_tensor = None 2025-05-07T20:33:33.0569894Z 2025-05-07T20:33:33.0570024Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:33.0570115Z op = silu_mul_quant 2025-05-07T20:33:33.0570206Z if compiled: 2025-05-07T20:33:33.0570306Z op = torch.compile(op) 2025-05-07T20:33:33.0570460Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:33.0570535Z 2025-05-07T20:33:33.0570624Z > y_fp8, y_scale = fn() 2025-05-07T20:33:33.0570628Z 2025-05-07T20:33:33.0570729Z moe/activation_test.py:117: 2025-05-07T20:33:33.0570860Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:33.0570962Z moe/activation_test.py:115: in fn 2025-05-07T20:33:33.0571062Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:33.0571452Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:33.0571583Z return fn(*args, **kwargs) 2025-05-07T20:33:33.0572120Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:33.0572216Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:33.0572603Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:33.0572835Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:33.0573195Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:33.0573295Z kernel = self.compile( 2025-05-07T20:33:33.0573740Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:33.0573916Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:33.0574060Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:33.0574064Z 2025-05-07T20:33:33.0574274Z self = 2025-05-07T20:33:33.0575163Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:33.0575710Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f58fd5ccd30>} 2025-05-07T20:33:33.0576522Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:33.0576719Z context = 2025-05-07T20:33:33.0576726Z 2025-05-07T20:33:33.0576894Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:33.0577168Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:33.0577276Z module_map=module_map) 2025-05-07T20:33:33.0577437Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:33.0577540Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:33.0577618Z E ^ 2025-05-07T20:33:33.0577997Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:33.0578002Z 2025-05-07T20:33:33.0578451Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:33.0578456Z 2025-05-07T20:33:33.0578554Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:33.0578790Z self=, 2025-05-07T20:33:33.0578864Z T=16384, 2025-05-07T20:33:33.0578938Z D=5120, 2025-05-07T20:33:33.0579043Z scale_ub=1200.0, 2025-05-07T20:33:33.0579130Z contiguous=True, 2025-05-07T20:33:33.0579228Z compiled=False, 2025-05-07T20:33:33.0579307Z ) 2025-05-07T20:33:33.0579530Z self = 2025-05-07T20:33:33.0579753Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:33.0579758Z 2025-05-07T20:33:33.0579837Z @given( 2025-05-07T20:33:33.0579952Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:33.0580054Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:33.0580168Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:33.0580283Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:33.0580397Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:33.0580509Z ) 2025-05-07T20:33:33.0580765Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:33.0580862Z def test_silu_mul_quant( 2025-05-07T20:33:33.0580935Z self, 2025-05-07T20:33:33.0581010Z T: int, 2025-05-07T20:33:33.0581086Z D: int, 2025-05-07T20:33:33.0581183Z scale_ub: Optional[float], 2025-05-07T20:33:33.0581269Z contiguous: bool, 2025-05-07T20:33:33.0581358Z compiled: bool, 2025-05-07T20:33:33.0581434Z ) -> None: 2025-05-07T20:33:33.0581529Z torch.manual_seed(2025) 2025-05-07T20:33:33.0581603Z 2025-05-07T20:33:33.0581772Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:33.0581844Z 2025-05-07T20:33:33.0581975Z x_sign = torch.sign(x) 2025-05-07T20:33:33.0582098Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:33.0582189Z x = x_sign * x_clamp 2025-05-07T20:33:33.0582270Z x0 = x[:, :D] 2025-05-07T20:33:33.0582346Z x1 = x[:, D:] 2025-05-07T20:33:33.0582423Z 2025-05-07T20:33:33.0582504Z if contiguous: 2025-05-07T20:33:33.0582593Z x0 = x0.contiguous() 2025-05-07T20:33:33.0582947Z x1 = x1.contiguous() 2025-05-07T20:33:33.0583060Z 2025-05-07T20:33:33.0583188Z if scale_ub is not None: 2025-05-07T20:33:33.0583320Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:33.0583458Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:33.0583537Z ) 2025-05-07T20:33:33.0583611Z else: 2025-05-07T20:33:33.0583703Z scale_ub_tensor = None 2025-05-07T20:33:33.0583783Z 2025-05-07T20:33:33.0583911Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:33.0584002Z op = silu_mul_quant 2025-05-07T20:33:33.0584087Z if compiled: 2025-05-07T20:33:33.0584186Z op = torch.compile(op) 2025-05-07T20:33:33.0584292Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:33.0584370Z 2025-05-07T20:33:33.0584458Z > y_fp8, y_scale = fn() 2025-05-07T20:33:33.0584463Z 2025-05-07T20:33:33.0584559Z moe/activation_test.py:117: 2025-05-07T20:33:33.0584692Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:33.0584792Z moe/activation_test.py:115: in fn 2025-05-07T20:33:33.0584891Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:33.0585434Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:33.0585531Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:33.0585920Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:33.0586152Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:33.0586521Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:33.0586616Z kernel = self.compile( 2025-05-07T20:33:33.0587024Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:33.0587206Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:33.0587335Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:33.0587430Z 2025-05-07T20:33:33.0587645Z self = 2025-05-07T20:33:33.0588497Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:33.0589085Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f58fd7fe700>} 2025-05-07T20:33:33.0590069Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:33.0590269Z context = 2025-05-07T20:33:33.0590274Z 2025-05-07T20:33:33.0590448Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:33.0590724Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:33.0590829Z module_map=module_map) 2025-05-07T20:33:33.0590996Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:33.0591153Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:33.0591229Z E ^ 2025-05-07T20:33:33.0591613Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:33.0591620Z 2025-05-07T20:33:33.0592065Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:33.0592126Z 2025-05-07T20:33:33.0592229Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:33.0592456Z self=, 2025-05-07T20:33:33.0592531Z T=1, 2025-05-07T20:33:33.0592610Z D=7168, 2025-05-07T20:33:33.0592689Z scale_ub=1200.0, 2025-05-07T20:33:33.0592770Z contiguous=False, 2025-05-07T20:33:33.0592853Z compiled=False, 2025-05-07T20:33:33.0592920Z ) 2025-05-07T20:33:33.0593144Z self = 2025-05-07T20:33:33.0593319Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:33.0593324Z 2025-05-07T20:33:33.0593398Z @given( 2025-05-07T20:33:33.0593514Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:33.0593613Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:33.0593725Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:33.0593841Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:33.0593952Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:33.0594026Z ) 2025-05-07T20:33:33.0594283Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:33.0594373Z def test_silu_mul_quant( 2025-05-07T20:33:33.0594451Z self, 2025-05-07T20:33:33.0594522Z T: int, 2025-05-07T20:33:33.0594595Z D: int, 2025-05-07T20:33:33.0594694Z scale_ub: Optional[float], 2025-05-07T20:33:33.0594778Z contiguous: bool, 2025-05-07T20:33:33.0594861Z compiled: bool, 2025-05-07T20:33:33.0594937Z ) -> None: 2025-05-07T20:33:33.0595026Z torch.manual_seed(2025) 2025-05-07T20:33:33.0595098Z 2025-05-07T20:33:33.0595271Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:33.0595349Z 2025-05-07T20:33:33.0595437Z x_sign = torch.sign(x) 2025-05-07T20:33:33.0595560Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:33.0595649Z x = x_sign * x_clamp 2025-05-07T20:33:33.0595730Z x0 = x[:, :D] 2025-05-07T20:33:33.0595803Z x1 = x[:, D:] 2025-05-07T20:33:33.0595874Z 2025-05-07T20:33:33.0596004Z if contiguous: 2025-05-07T20:33:33.0596095Z x0 = x0.contiguous() 2025-05-07T20:33:33.0596179Z x1 = x1.contiguous() 2025-05-07T20:33:33.0596255Z 2025-05-07T20:33:33.0596343Z if scale_ub is not None: 2025-05-07T20:33:33.0596446Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:33.0596586Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:33.0596656Z ) 2025-05-07T20:33:33.0596729Z else: 2025-05-07T20:33:33.0596824Z scale_ub_tensor = None 2025-05-07T20:33:33.0596946Z 2025-05-07T20:33:33.0597071Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:33.0597162Z op = silu_mul_quant 2025-05-07T20:33:33.0597243Z if compiled: 2025-05-07T20:33:33.0597346Z op = torch.compile(op) 2025-05-07T20:33:33.0597447Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:33.0597517Z 2025-05-07T20:33:33.0597608Z > y_fp8, y_scale = fn() 2025-05-07T20:33:33.0597619Z 2025-05-07T20:33:33.0597711Z moe/activation_test.py:117: 2025-05-07T20:33:33.0597840Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:33.0597940Z moe/activation_test.py:115: in fn 2025-05-07T20:33:33.0598035Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:33.0598618Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:33.0598713Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:33.0599094Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:33.0599328Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:33.0599726Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:33.0599816Z kernel = self.compile( 2025-05-07T20:33:33.0600228Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:33.0600403Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:33.0600531Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:33.0600539Z 2025-05-07T20:33:33.0600749Z self = 2025-05-07T20:33:33.0601590Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:33.0602135Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f58fd54f0d0>} 2025-05-07T20:33:33.0602950Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:33.0603150Z context = 2025-05-07T20:33:33.0603155Z 2025-05-07T20:33:33.0603325Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:33.0603602Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:33.0603707Z module_map=module_map) 2025-05-07T20:33:33.0603867Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:33.0603964Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:33.0604033Z E ^ 2025-05-07T20:33:33.0604414Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:33.0604419Z 2025-05-07T20:33:33.0604906Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:33.0604911Z 2025-05-07T20:33:33.0605011Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:33.0605245Z self=, 2025-05-07T20:33:33.0605317Z T=4096, 2025-05-07T20:33:33.0605392Z D=7168, 2025-05-07T20:33:33.0605473Z scale_ub=1200.0, 2025-05-07T20:33:33.0605556Z contiguous=False, 2025-05-07T20:33:33.0605633Z compiled=True, 2025-05-07T20:33:33.0605744Z ) 2025-05-07T20:33:33.0605966Z self = 2025-05-07T20:33:33.0606142Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:33.0606149Z 2025-05-07T20:33:33.0606224Z @given( 2025-05-07T20:33:33.0606339Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:33.0606437Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:33.0606552Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:33.0606666Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:33.0606776Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:33.0606847Z ) 2025-05-07T20:33:33.0607140Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:33.0607240Z def test_silu_mul_quant( 2025-05-07T20:33:33.0607316Z self, 2025-05-07T20:33:33.0607389Z T: int, 2025-05-07T20:33:33.0607463Z D: int, 2025-05-07T20:33:33.0607560Z scale_ub: Optional[float], 2025-05-07T20:33:33.0607648Z contiguous: bool, 2025-05-07T20:33:33.0607731Z compiled: bool, 2025-05-07T20:33:33.0607806Z ) -> None: 2025-05-07T20:33:33.0607943Z torch.manual_seed(2025) 2025-05-07T20:33:33.0608014Z 2025-05-07T20:33:33.0608182Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:33.0608257Z 2025-05-07T20:33:33.0608347Z x_sign = torch.sign(x) 2025-05-07T20:33:33.0608469Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:33.0608557Z x = x_sign * x_clamp 2025-05-07T20:33:33.0608633Z x0 = x[:, :D] 2025-05-07T20:33:33.0608710Z x1 = x[:, D:] 2025-05-07T20:33:33.0608783Z 2025-05-07T20:33:33.0608865Z if contiguous: 2025-05-07T20:33:33.0608953Z x0 = x0.contiguous() 2025-05-07T20:33:33.0609040Z x1 = x1.contiguous() 2025-05-07T20:33:33.0609113Z 2025-05-07T20:33:33.0609205Z if scale_ub is not None: 2025-05-07T20:33:33.0609308Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:33.0609441Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:33.0609514Z ) 2025-05-07T20:33:33.0609589Z else: 2025-05-07T20:33:33.0609679Z scale_ub_tensor = None 2025-05-07T20:33:33.0609748Z 2025-05-07T20:33:33.0609875Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:33.0609960Z op = silu_mul_quant 2025-05-07T20:33:33.0610044Z if compiled: 2025-05-07T20:33:33.0610142Z op = torch.compile(op) 2025-05-07T20:33:33.0610242Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:33.0610314Z 2025-05-07T20:33:33.0610401Z > y_fp8, y_scale = fn() 2025-05-07T20:33:33.0610409Z 2025-05-07T20:33:33.0610504Z moe/activation_test.py:117: 2025-05-07T20:33:33.0610632Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:33.0610731Z moe/activation_test.py:115: in fn 2025-05-07T20:33:33.0610828Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:33.0611218Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:33.0611314Z return fn(*args, **kwargs) 2025-05-07T20:33:33.0611894Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:33.0611989Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:33.0612369Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:33.0612598Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:33.0612957Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:33.0613050Z kernel = self.compile( 2025-05-07T20:33:33.0613495Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:33.0613671Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:33.0613804Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:33.0613808Z 2025-05-07T20:33:33.0614016Z self = 2025-05-07T20:33:33.0614864Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:33.0615469Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f58fd54fdc0>} 2025-05-07T20:33:33.0616280Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:33.0616475Z context = 2025-05-07T20:33:33.0616541Z 2025-05-07T20:33:33.0616707Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:33.0616987Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:33.0617089Z module_map=module_map) 2025-05-07T20:33:33.0617251Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:33.0617346Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:33.0617415Z E ^ 2025-05-07T20:33:33.0617797Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:33.0617802Z 2025-05-07T20:33:33.0618242Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:33.0618248Z 2025-05-07T20:33:33.0618346Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:33.0618576Z self=, 2025-05-07T20:33:33.0618651Z T=128, 2025-05-07T20:33:33.0618726Z D=7168, 2025-05-07T20:33:33.0618805Z scale_ub=1200.0, 2025-05-07T20:33:33.0618887Z contiguous=False, 2025-05-07T20:33:33.0618972Z compiled=True, 2025-05-07T20:33:33.0619043Z ) 2025-05-07T20:33:33.0619267Z self = 2025-05-07T20:33:33.0619444Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:33.0619449Z 2025-05-07T20:33:33.0619522Z @given( 2025-05-07T20:33:33.0619636Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:33.0619732Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:33.0619844Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:33.0619962Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:33.0620071Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:33.0620143Z ) 2025-05-07T20:33:33.0620409Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:33.0620501Z def test_silu_mul_quant( 2025-05-07T20:33:33.0620573Z self, 2025-05-07T20:33:33.0620691Z T: int, 2025-05-07T20:33:33.0620766Z D: int, 2025-05-07T20:33:33.0620862Z scale_ub: Optional[float], 2025-05-07T20:33:33.0620953Z contiguous: bool, 2025-05-07T20:33:33.0621034Z compiled: bool, 2025-05-07T20:33:33.0621110Z ) -> None: 2025-05-07T20:33:33.0621202Z torch.manual_seed(2025) 2025-05-07T20:33:33.0621273Z 2025-05-07T20:33:33.0621442Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:33.0621512Z 2025-05-07T20:33:33.0621601Z x_sign = torch.sign(x) 2025-05-07T20:33:33.0621765Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:33.0621850Z x = x_sign * x_clamp 2025-05-07T20:33:33.0621927Z x0 = x[:, :D] 2025-05-07T20:33:33.0622008Z x1 = x[:, D:] 2025-05-07T20:33:33.0622079Z 2025-05-07T20:33:33.0622155Z if contiguous: 2025-05-07T20:33:33.0622247Z x0 = x0.contiguous() 2025-05-07T20:33:33.0622330Z x1 = x1.contiguous() 2025-05-07T20:33:33.0622400Z 2025-05-07T20:33:33.0622489Z if scale_ub is not None: 2025-05-07T20:33:33.0622590Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:33.0622724Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:33.0622793Z ) 2025-05-07T20:33:33.0622865Z else: 2025-05-07T20:33:33.0622997Z scale_ub_tensor = None 2025-05-07T20:33:33.0623069Z 2025-05-07T20:33:33.0623194Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:33.0623282Z op = silu_mul_quant 2025-05-07T20:33:33.0623360Z if compiled: 2025-05-07T20:33:33.0623454Z op = torch.compile(op) 2025-05-07T20:33:33.0623558Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:33.0623665Z 2025-05-07T20:33:33.0623752Z > y_fp8, y_scale = fn() 2025-05-07T20:33:33.0623756Z 2025-05-07T20:33:33.0623850Z moe/activation_test.py:117: 2025-05-07T20:33:33.0623983Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:33.0624084Z moe/activation_test.py:115: in fn 2025-05-07T20:33:33.0624178Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:33.0624566Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:33.0624664Z return fn(*args, **kwargs) 2025-05-07T20:33:33.0625199Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:33.0625295Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:33.0625675Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:33.0625906Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:33.0626273Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:33.0626364Z kernel = self.compile( 2025-05-07T20:33:33.0626769Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:33.0626950Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:33.0627078Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:33.0627083Z 2025-05-07T20:33:33.0627296Z self = 2025-05-07T20:33:33.0628142Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:33.0628686Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f58fd466940>} 2025-05-07T20:33:33.0629589Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:33.0629941Z context = 2025-05-07T20:33:33.0629945Z 2025-05-07T20:33:33.0630118Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:33.0630390Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:33.0630533Z module_map=module_map) 2025-05-07T20:33:33.0630696Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:33.0630792Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:33.0630874Z E ^ 2025-05-07T20:33:33.0631251Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:33.0631256Z 2025-05-07T20:33:33.0631702Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:33.0631707Z 2025-05-07T20:33:33.0631807Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:33.0632074Z self=, 2025-05-07T20:33:33.0632153Z T=2048, 2025-05-07T20:33:33.0632227Z D=7168, 2025-05-07T20:33:33.0632303Z scale_ub=None, 2025-05-07T20:33:33.0632384Z contiguous=True, 2025-05-07T20:33:33.0632466Z compiled=True, 2025-05-07T20:33:33.0632536Z ) 2025-05-07T20:33:33.0632761Z self = 2025-05-07T20:33:33.0632932Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:33.0632978Z 2025-05-07T20:33:33.0633051Z @given( 2025-05-07T20:33:33.0633172Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:33.0633271Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:33.0633381Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:33.0633500Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:33.0633608Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:33.0633684Z ) 2025-05-07T20:33:33.0633942Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:33.0634031Z def test_silu_mul_quant( 2025-05-07T20:33:33.0634106Z self, 2025-05-07T20:33:33.0634179Z T: int, 2025-05-07T20:33:33.0634256Z D: int, 2025-05-07T20:33:33.0634353Z scale_ub: Optional[float], 2025-05-07T20:33:33.0634438Z contiguous: bool, 2025-05-07T20:33:33.0634520Z compiled: bool, 2025-05-07T20:33:33.0634598Z ) -> None: 2025-05-07T20:33:33.0634687Z torch.manual_seed(2025) 2025-05-07T20:33:33.0634756Z 2025-05-07T20:33:33.0634929Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:33.0635004Z 2025-05-07T20:33:33.0635094Z x_sign = torch.sign(x) 2025-05-07T20:33:33.0635215Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:33.0635301Z x = x_sign * x_clamp 2025-05-07T20:33:33.0635382Z x0 = x[:, :D] 2025-05-07T20:33:33.0635458Z x1 = x[:, D:] 2025-05-07T20:33:33.0635531Z 2025-05-07T20:33:33.0635614Z if contiguous: 2025-05-07T20:33:33.0635705Z x0 = x0.contiguous() 2025-05-07T20:33:33.0635790Z x1 = x1.contiguous() 2025-05-07T20:33:33.0639707Z 2025-05-07T20:33:33.0639818Z if scale_ub is not None: 2025-05-07T20:33:33.0639925Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:33.0640065Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:33.0640143Z ) 2025-05-07T20:33:33.0640218Z else: 2025-05-07T20:33:33.0640313Z scale_ub_tensor = None 2025-05-07T20:33:33.0640385Z 2025-05-07T20:33:33.0640583Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:33.0640675Z op = silu_mul_quant 2025-05-07T20:33:33.0640758Z if compiled: 2025-05-07T20:33:33.0640863Z op = torch.compile(op) 2025-05-07T20:33:33.0640964Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:33.0641035Z 2025-05-07T20:33:33.0641131Z > y_fp8, y_scale = fn() 2025-05-07T20:33:33.0641136Z 2025-05-07T20:33:33.0641232Z moe/activation_test.py:117: 2025-05-07T20:33:33.0641365Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:33.0641513Z moe/activation_test.py:115: in fn 2025-05-07T20:33:33.0641610Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:33.0642009Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:33.0642108Z return fn(*args, **kwargs) 2025-05-07T20:33:33.0642649Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:33.0642748Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:33.0643131Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:33.0643403Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:33.0643770Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:33.0643863Z kernel = self.compile( 2025-05-07T20:33:33.0644277Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:33.0644456Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:33.0644627Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:33.0644632Z 2025-05-07T20:33:33.0644852Z self = 2025-05-07T20:33:33.0645707Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:33.0646257Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f58fd5f3550>} 2025-05-07T20:33:33.0647071Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:33.0647267Z context = 2025-05-07T20:33:33.0647274Z 2025-05-07T20:33:33.0647446Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:33.0647725Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:33.0647835Z module_map=module_map) 2025-05-07T20:33:33.0647997Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:33.0648092Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:33.0648179Z E ^ 2025-05-07T20:33:33.0648564Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:33.0648569Z 2025-05-07T20:33:33.0649021Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:33.0649026Z 2025-05-07T20:33:33.0649126Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:33.0649357Z self=, 2025-05-07T20:33:33.0649432Z T=16384, 2025-05-07T20:33:33.0649507Z D=5120, 2025-05-07T20:33:33.0649584Z scale_ub=None, 2025-05-07T20:33:33.0649715Z contiguous=False, 2025-05-07T20:33:33.0649798Z compiled=False, 2025-05-07T20:33:33.0649870Z ) 2025-05-07T20:33:33.0650097Z self = 2025-05-07T20:33:33.0650278Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:33.0650285Z 2025-05-07T20:33:33.0650361Z @given( 2025-05-07T20:33:33.0650478Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:33.0650572Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:33.0650733Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:33.0650849Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:33.0650960Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:33.0651037Z ) 2025-05-07T20:33:33.0651294Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:33.0651383Z def test_silu_mul_quant( 2025-05-07T20:33:33.0651466Z self, 2025-05-07T20:33:33.0651538Z T: int, 2025-05-07T20:33:33.0651614Z D: int, 2025-05-07T20:33:33.0651711Z scale_ub: Optional[float], 2025-05-07T20:33:33.0651796Z contiguous: bool, 2025-05-07T20:33:33.0651879Z compiled: bool, 2025-05-07T20:33:33.0651955Z ) -> None: 2025-05-07T20:33:33.0652087Z torch.manual_seed(2025) 2025-05-07T20:33:33.0652162Z 2025-05-07T20:33:33.0652335Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:33.0652412Z 2025-05-07T20:33:33.0652502Z x_sign = torch.sign(x) 2025-05-07T20:33:33.0652623Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:33.0654610Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:33.0654658Z 2025-05-07T20:33:33.0654777Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:33.0654787Z 2025-05-07T20:33:33.0654887Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:33.0655113Z self=, 2025-05-07T20:33:33.0655189Z T=4096, 2025-05-07T20:33:33.0655260Z D=7168, 2025-05-07T20:33:33.0655339Z scale_ub=1200.0, 2025-05-07T20:33:33.0655424Z contiguous=True, 2025-05-07T20:33:33.0655511Z compiled=True, 2025-05-07T20:33:33.0655584Z ) 2025-05-07T20:33:33.0655811Z self = 2025-05-07T20:33:33.0655986Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:33.0655991Z 2025-05-07T20:33:33.0656070Z @given( 2025-05-07T20:33:33.0656187Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:33.0656283Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:33.0656398Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:33.0656517Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:33.0656628Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:33.0656706Z ) 2025-05-07T20:33:33.0656967Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:33.0657056Z def test_silu_mul_quant( 2025-05-07T20:33:33.0657135Z self, 2025-05-07T20:33:33.0657210Z T: int, 2025-05-07T20:33:33.0657289Z D: int, 2025-05-07T20:33:33.0657386Z scale_ub: Optional[float], 2025-05-07T20:33:33.0657472Z contiguous: bool, 2025-05-07T20:33:33.0657562Z compiled: bool, 2025-05-07T20:33:33.0657679Z ) -> None: 2025-05-07T20:33:33.0657774Z torch.manual_seed(2025) 2025-05-07T20:33:33.0657847Z 2025-05-07T20:33:33.0658015Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:33.0658085Z 2025-05-07T20:33:33.0658180Z x_sign = torch.sign(x) 2025-05-07T20:33:33.0658304Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:33.0660284Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:33.0660333Z 2025-05-07T20:33:33.0660449Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:33.0660454Z 2025-05-07T20:33:33.0660553Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:33.0660786Z self=, 2025-05-07T20:33:33.0660860Z T=16384, 2025-05-07T20:33:33.0660974Z D=7168, 2025-05-07T20:33:33.0661057Z scale_ub=None, 2025-05-07T20:33:33.0661139Z contiguous=False, 2025-05-07T20:33:33.0661226Z compiled=False, 2025-05-07T20:33:33.0661299Z ) 2025-05-07T20:33:33.0661522Z self = 2025-05-07T20:33:33.0661704Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:33.0661748Z 2025-05-07T20:33:33.0661824Z @given( 2025-05-07T20:33:33.0661942Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:33.0662040Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:33.0662153Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:33.0662271Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:33.0662380Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:33.0662453Z ) 2025-05-07T20:33:33.0662717Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:33.0662809Z def test_silu_mul_quant( 2025-05-07T20:33:33.0662883Z self, 2025-05-07T20:33:33.0662962Z T: int, 2025-05-07T20:33:33.0663035Z D: int, 2025-05-07T20:33:33.0663131Z scale_ub: Optional[float], 2025-05-07T20:33:33.0663222Z contiguous: bool, 2025-05-07T20:33:33.0663305Z compiled: bool, 2025-05-07T20:33:33.0663379Z ) -> None: 2025-05-07T20:33:33.0663476Z torch.manual_seed(2025) 2025-05-07T20:33:33.0663546Z 2025-05-07T20:33:33.0663715Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:33.0665685Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:33.0665694Z 2025-05-07T20:33:33.0665809Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:33.0665813Z 2025-05-07T20:33:33.0665911Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:33.0666136Z self=, 2025-05-07T20:33:33.0666214Z T=2048, 2025-05-07T20:33:33.0666286Z D=7168, 2025-05-07T20:33:33.0666366Z scale_ub=1200.0, 2025-05-07T20:33:33.0666503Z contiguous=True, 2025-05-07T20:33:33.0666590Z compiled=True, 2025-05-07T20:33:33.0666662Z ) 2025-05-07T20:33:33.0666889Z self = 2025-05-07T20:33:33.0667064Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:33.0667069Z 2025-05-07T20:33:33.0667153Z @given( 2025-05-07T20:33:33.0667270Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:33.0667367Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:33.0667522Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:33.0667634Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:33.0667742Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:33.0667821Z ) 2025-05-07T20:33:33.0668079Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:33.0668172Z def test_silu_mul_quant( 2025-05-07T20:33:33.0668246Z self, 2025-05-07T20:33:33.0668319Z T: int, 2025-05-07T20:33:33.0668398Z D: int, 2025-05-07T20:33:33.0668494Z scale_ub: Optional[float], 2025-05-07T20:33:33.0668583Z contiguous: bool, 2025-05-07T20:33:33.0668668Z compiled: bool, 2025-05-07T20:33:33.0668742Z ) -> None: 2025-05-07T20:33:33.0668897Z torch.manual_seed(2025) 2025-05-07T20:33:33.0668977Z 2025-05-07T20:33:33.0669173Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:33.0669259Z 2025-05-07T20:33:33.0669357Z x_sign = torch.sign(x) 2025-05-07T20:33:33.0669480Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:33.0671541Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:33.0671589Z 2025-05-07T20:33:33.0671708Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:33.0671713Z 2025-05-07T20:33:33.0671817Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:33.0672043Z self=, 2025-05-07T20:33:33.0672119Z T=2048, 2025-05-07T20:33:33.0672194Z D=7168, 2025-05-07T20:33:33.0672273Z scale_ub=None, 2025-05-07T20:33:33.0672356Z contiguous=True, 2025-05-07T20:33:33.0672441Z compiled=False, 2025-05-07T20:33:33.0672517Z ) 2025-05-07T20:33:33.0672739Z self = 2025-05-07T20:33:33.0672919Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:33.0672923Z 2025-05-07T20:33:33.0672996Z @given( 2025-05-07T20:33:33.0673117Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:33.0673213Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:33.0673322Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:33.0673441Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:33.0673549Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:33.0673620Z ) 2025-05-07T20:33:33.0673881Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:33.0673974Z def test_silu_mul_quant( 2025-05-07T20:33:33.0674048Z self, 2025-05-07T20:33:33.0674126Z T: int, 2025-05-07T20:33:33.0674205Z D: int, 2025-05-07T20:33:33.0674300Z scale_ub: Optional[float], 2025-05-07T20:33:33.0674388Z contiguous: bool, 2025-05-07T20:33:33.0674472Z compiled: bool, 2025-05-07T20:33:33.0674598Z ) -> None: 2025-05-07T20:33:33.0674691Z torch.manual_seed(2025) 2025-05-07T20:33:33.0674763Z 2025-05-07T20:33:33.0674935Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:33.0675003Z 2025-05-07T20:33:33.0675094Z > x_sign = torch.sign(x) 2025-05-07T20:33:33.0677045Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:33.0677094Z 2025-05-07T20:33:33.0677210Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:33:33.0677218Z 2025-05-07T20:33:33.0677320Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:33.0677545Z self=, 2025-05-07T20:33:33.0677617Z T=1, 2025-05-07T20:33:33.0677693Z D=7168, 2025-05-07T20:33:33.0677770Z scale_ub=1200.0, 2025-05-07T20:33:33.0677892Z contiguous=True, 2025-05-07T20:33:33.0677977Z compiled=False, 2025-05-07T20:33:33.0678045Z ) 2025-05-07T20:33:33.0678271Z self = 2025-05-07T20:33:33.0678441Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:33.0678445Z 2025-05-07T20:33:33.0678517Z @given( 2025-05-07T20:33:33.0678640Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:33.0678785Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:33.0678916Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:33.0679049Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:33.0679177Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:33.0679251Z ) 2025-05-07T20:33:33.0679506Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:33.0679599Z def test_silu_mul_quant( 2025-05-07T20:33:33.0679675Z self, 2025-05-07T20:33:33.0679753Z T: int, 2025-05-07T20:33:33.0679827Z D: int, 2025-05-07T20:33:33.0679926Z scale_ub: Optional[float], 2025-05-07T20:33:33.0680014Z contiguous: bool, 2025-05-07T20:33:33.0680100Z compiled: bool, 2025-05-07T20:33:33.0680179Z ) -> None: 2025-05-07T20:33:33.0680270Z torch.manual_seed(2025) 2025-05-07T20:33:33.0680344Z 2025-05-07T20:33:33.0680515Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:33.0680592Z 2025-05-07T20:33:33.0680683Z x_sign = torch.sign(x) 2025-05-07T20:33:33.0680807Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:33.0680894Z x = x_sign * x_clamp 2025-05-07T20:33:33.0680974Z x0 = x[:, :D] 2025-05-07T20:33:33.0681056Z x1 = x[:, D:] 2025-05-07T20:33:33.0681128Z 2025-05-07T20:33:33.0681210Z if contiguous: 2025-05-07T20:33:33.0681300Z x0 = x0.contiguous() 2025-05-07T20:33:33.0681393Z x1 = x1.contiguous() 2025-05-07T20:33:33.0681473Z 2025-05-07T20:33:33.0681564Z if scale_ub is not None: 2025-05-07T20:33:33.0681668Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:33.0681808Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:33.0681881Z ) 2025-05-07T20:33:33.0681956Z else: 2025-05-07T20:33:33.0682049Z scale_ub_tensor = None 2025-05-07T20:33:33.0682120Z 2025-05-07T20:33:33.0682251Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:33.0682339Z op = silu_mul_quant 2025-05-07T20:33:33.0682424Z if compiled: 2025-05-07T20:33:33.0682574Z op = torch.compile(op) 2025-05-07T20:33:33.0682679Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:33.0683060Z 2025-05-07T20:33:33.0683200Z > y_fp8, y_scale = fn() 2025-05-07T20:33:33.0683207Z 2025-05-07T20:33:33.0683310Z moe/activation_test.py:117: 2025-05-07T20:33:33.0683442Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:33.0683546Z moe/activation_test.py:115: in fn 2025-05-07T20:33:33.0683642Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:33.0684281Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:33.0684374Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:33.0684759Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:33.0684999Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:33.0685358Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:33.0685448Z kernel = self.compile( 2025-05-07T20:33:33.0685922Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:33.0686099Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:33.0686231Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:33.0686239Z 2025-05-07T20:33:33.0686446Z self = 2025-05-07T20:33:33.0687291Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:33.0687905Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f58fd197040>} 2025-05-07T20:33:33.0688724Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:33.0688922Z context = 2025-05-07T20:33:33.0688927Z 2025-05-07T20:33:33.0689098Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:33.0689375Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:33.0689478Z module_map=module_map) 2025-05-07T20:33:33.0689641Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:33.0689745Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:33.0689816Z E ^ 2025-05-07T20:33:33.0690197Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:33.0690201Z 2025-05-07T20:33:33.0690646Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:33.0690651Z 2025-05-07T20:33:33.0690753Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:33.0690981Z self=, 2025-05-07T20:33:33.0691057Z T=128, 2025-05-07T20:33:33.0691126Z D=5120, 2025-05-07T20:33:33.0691206Z scale_ub=None, 2025-05-07T20:33:33.0691285Z contiguous=True, 2025-05-07T20:33:33.0691364Z compiled=False, 2025-05-07T20:33:33.0691437Z ) 2025-05-07T20:33:33.0691662Z self = 2025-05-07T20:33:33.0691834Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:33.0691844Z 2025-05-07T20:33:33.0691978Z @given( 2025-05-07T20:33:33.0692095Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:33.0692193Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:33.0692305Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:33.0692418Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:33.0692533Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:33.0692602Z ) 2025-05-07T20:33:33.0692857Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:33.0692992Z def test_silu_mul_quant( 2025-05-07T20:33:33.0693065Z self, 2025-05-07T20:33:33.0693135Z T: int, 2025-05-07T20:33:33.0693210Z D: int, 2025-05-07T20:33:33.0693311Z scale_ub: Optional[float], 2025-05-07T20:33:33.0693396Z contiguous: bool, 2025-05-07T20:33:33.0693476Z compiled: bool, 2025-05-07T20:33:33.0693551Z ) -> None: 2025-05-07T20:33:33.0693645Z torch.manual_seed(2025) 2025-05-07T20:33:33.0693715Z 2025-05-07T20:33:33.0693884Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:33.0693957Z 2025-05-07T20:33:33.0694045Z x_sign = torch.sign(x) 2025-05-07T20:33:33.0694166Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:33.0694298Z x = x_sign * x_clamp 2025-05-07T20:33:33.0694376Z x0 = x[:, :D] 2025-05-07T20:33:33.0694452Z x1 = x[:, D:] 2025-05-07T20:33:33.0694521Z 2025-05-07T20:33:33.0694604Z if contiguous: 2025-05-07T20:33:33.0694695Z x0 = x0.contiguous() 2025-05-07T20:33:33.0694782Z x1 = x1.contiguous() 2025-05-07T20:33:33.0694850Z 2025-05-07T20:33:33.0694940Z if scale_ub is not None: 2025-05-07T20:33:33.0695084Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:33.0695220Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:33.0695297Z ) 2025-05-07T20:33:33.0695371Z else: 2025-05-07T20:33:33.0695460Z scale_ub_tensor = None 2025-05-07T20:33:33.0695535Z 2025-05-07T20:33:33.0695661Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:33.0695746Z op = silu_mul_quant 2025-05-07T20:33:33.0695830Z if compiled: 2025-05-07T20:33:33.0695928Z op = torch.compile(op) 2025-05-07T20:33:33.0696036Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:33.0696105Z 2025-05-07T20:33:33.0696193Z > y_fp8, y_scale = fn() 2025-05-07T20:33:33.0696203Z 2025-05-07T20:33:33.0696299Z moe/activation_test.py:117: 2025-05-07T20:33:33.0696426Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:33.0696524Z moe/activation_test.py:115: in fn 2025-05-07T20:33:33.0696625Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:33.0697168Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:33.0697260Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:33.0697644Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:33.0697883Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:33.0698247Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:33.0698335Z kernel = self.compile( 2025-05-07T20:33:33.0698743Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:33.0698932Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:33.0699077Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:33.0699082Z 2025-05-07T20:33:33.0699320Z self = 2025-05-07T20:33:33.0700212Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:33.0700758Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f58fd197a60>} 2025-05-07T20:33:33.0701575Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:33.0701830Z context = 2025-05-07T20:33:33.0701836Z 2025-05-07T20:33:33.0702006Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:33.0702280Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:33.0702383Z module_map=module_map) 2025-05-07T20:33:33.0702548Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:33.0702641Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:33.0702712Z E ^ 2025-05-07T20:33:33.0703130Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:33.0703135Z 2025-05-07T20:33:33.0703585Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:33.0703592Z 2025-05-07T20:33:33.0703699Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:33.0703928Z self=, 2025-05-07T20:33:33.0704043Z T=128, 2025-05-07T20:33:33.0704121Z D=7168, 2025-05-07T20:33:33.0704198Z scale_ub=None, 2025-05-07T20:33:33.0704282Z contiguous=True, 2025-05-07T20:33:33.0704365Z compiled=False, 2025-05-07T20:33:33.0704436Z ) 2025-05-07T20:33:33.0704658Z self = 2025-05-07T20:33:33.0704831Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:33.0704836Z 2025-05-07T20:33:33.0704912Z @given( 2025-05-07T20:33:33.0705028Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:33.0705125Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:33.0705240Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:33.0705353Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:33.0705461Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:33.0705537Z ) 2025-05-07T20:33:33.0705794Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:33.0705882Z def test_silu_mul_quant( 2025-05-07T20:33:33.0705953Z self, 2025-05-07T20:33:33.0706033Z T: int, 2025-05-07T20:33:33.0706107Z D: int, 2025-05-07T20:33:33.0706200Z scale_ub: Optional[float], 2025-05-07T20:33:33.0706287Z contiguous: bool, 2025-05-07T20:33:33.0706368Z compiled: bool, 2025-05-07T20:33:33.0706444Z ) -> None: 2025-05-07T20:33:33.0706542Z torch.manual_seed(2025) 2025-05-07T20:33:33.0706612Z 2025-05-07T20:33:33.0706781Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:33.0706850Z 2025-05-07T20:33:33.0706940Z x_sign = torch.sign(x) 2025-05-07T20:33:33.0707061Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:33.0707147Z x = x_sign * x_clamp 2025-05-07T20:33:33.0707224Z x0 = x[:, :D] 2025-05-07T20:33:33.0707305Z x1 = x[:, D:] 2025-05-07T20:33:33.0707373Z 2025-05-07T20:33:33.0707451Z if contiguous: 2025-05-07T20:33:33.0707542Z x0 = x0.contiguous() 2025-05-07T20:33:33.0707673Z x1 = x1.contiguous() 2025-05-07T20:33:33.0707746Z 2025-05-07T20:33:33.0707839Z if scale_ub is not None: 2025-05-07T20:33:33.0707940Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:33.0708074Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:33.0708149Z ) 2025-05-07T20:33:33.0708219Z else: 2025-05-07T20:33:33.0708312Z scale_ub_tensor = None 2025-05-07T20:33:33.0708381Z 2025-05-07T20:33:33.0708507Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:33.0708637Z op = silu_mul_quant 2025-05-07T20:33:33.0708717Z if compiled: 2025-05-07T20:33:33.0708811Z op = torch.compile(op) 2025-05-07T20:33:33.0708919Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:33.0708994Z 2025-05-07T20:33:33.0709080Z > y_fp8, y_scale = fn() 2025-05-07T20:33:33.0709089Z 2025-05-07T20:33:33.0709182Z moe/activation_test.py:117: 2025-05-07T20:33:33.0709313Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:33.0709415Z moe/activation_test.py:115: in fn 2025-05-07T20:33:33.0709509Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:33.0710146Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:33.0710246Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:33.0710628Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:33.0710863Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:33.0711224Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:33.0711355Z kernel = self.compile( 2025-05-07T20:33:33.0711766Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:33.0711943Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:33.0712072Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:33.0712076Z 2025-05-07T20:33:33.0712286Z self = 2025-05-07T20:33:33.0713133Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:33.0713678Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f58fd153790>} 2025-05-07T20:33:33.0714490Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:33.0714685Z context = 2025-05-07T20:33:33.0714690Z 2025-05-07T20:33:33.0714854Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:33.0715128Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:33.0715235Z module_map=module_map) 2025-05-07T20:33:33.0715396Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:33.0715493Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:33.0715566Z E ^ 2025-05-07T20:33:33.0715944Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:33.0715952Z 2025-05-07T20:33:33.0716398Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:33.0716403Z 2025-05-07T20:33:33.0716541Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:33.0716769Z self=, 2025-05-07T20:33:33.0716848Z T=2048, 2025-05-07T20:33:33.0716917Z D=7168, 2025-05-07T20:33:33.0716996Z scale_ub=1200.0, 2025-05-07T20:33:33.0717082Z contiguous=True, 2025-05-07T20:33:33.0717163Z compiled=False, 2025-05-07T20:33:33.0717240Z ) 2025-05-07T20:33:33.0717458Z self = 2025-05-07T20:33:33.0717673Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:33.0717678Z 2025-05-07T20:33:33.0717752Z @given( 2025-05-07T20:33:33.0717866Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:33.0717963Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:33.0718076Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:33.0718188Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:33.0718301Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:33.0718376Z ) 2025-05-07T20:33:33.0718633Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:33.0718726Z def test_silu_mul_quant( 2025-05-07T20:33:33.0718795Z self, 2025-05-07T20:33:33.0718905Z T: int, 2025-05-07T20:33:33.0718981Z D: int, 2025-05-07T20:33:33.0719077Z scale_ub: Optional[float], 2025-05-07T20:33:33.0719162Z contiguous: bool, 2025-05-07T20:33:33.0719248Z compiled: bool, 2025-05-07T20:33:33.0719320Z ) -> None: 2025-05-07T20:33:33.0719407Z torch.manual_seed(2025) 2025-05-07T20:33:33.0719477Z 2025-05-07T20:33:33.0719645Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:33.0721658Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:33.0721664Z 2025-05-07T20:33:33.0721777Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:33.0721783Z 2025-05-07T20:33:33.0721884Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:33.0722110Z self=, 2025-05-07T20:33:33.0722180Z T=1, 2025-05-07T20:33:33.0722257Z D=5120, 2025-05-07T20:33:33.0722335Z scale_ub=1200.0, 2025-05-07T20:33:33.0722416Z contiguous=True, 2025-05-07T20:33:33.0722498Z compiled=False, 2025-05-07T20:33:33.0722569Z ) 2025-05-07T20:33:33.0722792Z self = 2025-05-07T20:33:33.0722963Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:33.0722968Z 2025-05-07T20:33:33.0723039Z @given( 2025-05-07T20:33:33.0723157Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:33.0723253Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:33.0723364Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:33.0723483Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:33.0723593Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:33.0723661Z ) 2025-05-07T20:33:33.0723920Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:33.0724011Z def test_silu_mul_quant( 2025-05-07T20:33:33.0724083Z self, 2025-05-07T20:33:33.0724168Z T: int, 2025-05-07T20:33:33.0724239Z D: int, 2025-05-07T20:33:33.0724378Z scale_ub: Optional[float], 2025-05-07T20:33:33.0724469Z contiguous: bool, 2025-05-07T20:33:33.0724550Z compiled: bool, 2025-05-07T20:33:33.0724626Z ) -> None: 2025-05-07T20:33:33.0724716Z torch.manual_seed(2025) 2025-05-07T20:33:33.0724785Z 2025-05-07T20:33:33.0724958Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:33.0725029Z 2025-05-07T20:33:33.0725117Z x_sign = torch.sign(x) 2025-05-07T20:33:33.0725242Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:33.0725369Z x = x_sign * x_clamp 2025-05-07T20:33:33.0725446Z x0 = x[:, :D] 2025-05-07T20:33:33.0725528Z x1 = x[:, D:] 2025-05-07T20:33:33.0725597Z 2025-05-07T20:33:33.0725674Z if contiguous: 2025-05-07T20:33:33.0725768Z x0 = x0.contiguous() 2025-05-07T20:33:33.0725855Z x1 = x1.contiguous() 2025-05-07T20:33:33.0725927Z 2025-05-07T20:33:33.0726018Z if scale_ub is not None: 2025-05-07T20:33:33.0726122Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:33.0726257Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:33.0726326Z ) 2025-05-07T20:33:33.0726399Z else: 2025-05-07T20:33:33.0726491Z scale_ub_tensor = None 2025-05-07T20:33:33.0726562Z 2025-05-07T20:33:33.0726728Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:33.0726818Z op = silu_mul_quant 2025-05-07T20:33:33.0726902Z if compiled: 2025-05-07T20:33:33.0727000Z op = torch.compile(op) 2025-05-07T20:33:33.0727105Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:33.0727173Z 2025-05-07T20:33:33.0727257Z > y_fp8, y_scale = fn() 2025-05-07T20:33:33.0727304Z 2025-05-07T20:33:33.0727398Z moe/activation_test.py:117: 2025-05-07T20:33:33.0727526Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:33.0727625Z moe/activation_test.py:115: in fn 2025-05-07T20:33:33.0727723Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:33.0728264Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:33.0728358Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:33.0728740Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:33.0729000Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:33.0729387Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:33.0729475Z kernel = self.compile( 2025-05-07T20:33:33.0729885Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:33.0730063Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:33.0730190Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:33.0730195Z 2025-05-07T20:33:33.0730403Z self = 2025-05-07T20:33:33.0731248Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:33.0731792Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f58fd218040>} 2025-05-07T20:33:33.0732604Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:33.0732802Z context = 2025-05-07T20:33:33.0732871Z 2025-05-07T20:33:33.0733038Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:33.0733311Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:33.0733417Z module_map=module_map) 2025-05-07T20:33:33.0733578Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:33.0733672Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:33.0733749Z E ^ 2025-05-07T20:33:33.0734188Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:33.0734193Z 2025-05-07T20:33:33.0734640Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:33.0734647Z 2025-05-07T20:33:33.0734744Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:33.0734975Z self=, 2025-05-07T20:33:33.0735048Z T=2048, 2025-05-07T20:33:33.0735120Z D=5120, 2025-05-07T20:33:33.0735196Z scale_ub=None, 2025-05-07T20:33:33.0735280Z contiguous=True, 2025-05-07T20:33:33.0735359Z compiled=False, 2025-05-07T20:33:33.0735433Z ) 2025-05-07T20:33:33.0735694Z self = 2025-05-07T20:33:33.0735870Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:33.0735875Z 2025-05-07T20:33:33.0735956Z @given( 2025-05-07T20:33:33.0736070Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:33.0736164Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:33.0736282Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:33.0736435Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:33.0736542Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:33.0736617Z ) 2025-05-07T20:33:33.0736875Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:33.0736968Z def test_silu_mul_quant( 2025-05-07T20:33:33.0737037Z self, 2025-05-07T20:33:33.0737110Z T: int, 2025-05-07T20:33:33.0737185Z D: int, 2025-05-07T20:33:33.0737279Z scale_ub: Optional[float], 2025-05-07T20:33:33.0737367Z contiguous: bool, 2025-05-07T20:33:33.0737453Z compiled: bool, 2025-05-07T20:33:33.0737527Z ) -> None: 2025-05-07T20:33:33.0737616Z torch.manual_seed(2025) 2025-05-07T20:33:33.0737691Z 2025-05-07T20:33:33.0737857Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:33.0737927Z 2025-05-07T20:33:33.0738016Z > x_sign = torch.sign(x) 2025-05-07T20:33:33.0740035Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:33.0740041Z 2025-05-07T20:33:33.0740156Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:33:33.0740161Z 2025-05-07T20:33:33.0740258Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:33.0740492Z self=, 2025-05-07T20:33:33.0740563Z T=16384, 2025-05-07T20:33:33.0740636Z D=5120, 2025-05-07T20:33:33.0740719Z scale_ub=None, 2025-05-07T20:33:33.0740798Z contiguous=True, 2025-05-07T20:33:33.0740877Z compiled=False, 2025-05-07T20:33:33.0740949Z ) 2025-05-07T20:33:33.0741213Z self = 2025-05-07T20:33:33.0741393Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:33.0741400Z 2025-05-07T20:33:33.0741472Z @given( 2025-05-07T20:33:33.0741586Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:33.0741682Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:33.0741796Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:33.0741906Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:33.0742017Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:33.0742126Z ) 2025-05-07T20:33:33.0742381Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:33.0742472Z def test_silu_mul_quant( 2025-05-07T20:33:33.0742545Z self, 2025-05-07T20:33:33.0742620Z T: int, 2025-05-07T20:33:33.0742694Z D: int, 2025-05-07T20:33:33.0742787Z scale_ub: Optional[float], 2025-05-07T20:33:33.0742877Z contiguous: bool, 2025-05-07T20:33:33.0742959Z compiled: bool, 2025-05-07T20:33:33.0743031Z ) -> None: 2025-05-07T20:33:33.0743123Z torch.manual_seed(2025) 2025-05-07T20:33:33.0743194Z 2025-05-07T20:33:33.0743359Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:33.0745356Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:33.0745402Z 2025-05-07T20:33:33.0745514Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:33.0745520Z 2025-05-07T20:33:33.0745619Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:33.0745844Z self=, 2025-05-07T20:33:33.0745914Z T=4096, 2025-05-07T20:33:33.0745985Z D=5120, 2025-05-07T20:33:33.0746061Z scale_ub=None, 2025-05-07T20:33:33.0746148Z contiguous=True, 2025-05-07T20:33:33.0746227Z compiled=False, 2025-05-07T20:33:33.0746296Z ) 2025-05-07T20:33:33.0746522Z self = 2025-05-07T20:33:33.0746700Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:33.0746704Z 2025-05-07T20:33:33.0746778Z @given( 2025-05-07T20:33:33.0746894Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:33.0746991Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:33.0747101Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:33.0747218Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:33.0747325Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:33.0747407Z ) 2025-05-07T20:33:33.0747662Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:33.0747752Z def test_silu_mul_quant( 2025-05-07T20:33:33.0747829Z self, 2025-05-07T20:33:33.0747900Z T: int, 2025-05-07T20:33:33.0747973Z D: int, 2025-05-07T20:33:33.0748071Z scale_ub: Optional[float], 2025-05-07T20:33:33.0748161Z contiguous: bool, 2025-05-07T20:33:33.0748244Z compiled: bool, 2025-05-07T20:33:33.0748319Z ) -> None: 2025-05-07T20:33:33.0748407Z torch.manual_seed(2025) 2025-05-07T20:33:33.0748477Z 2025-05-07T20:33:33.0748652Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:33.0750757Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:33.0750763Z 2025-05-07T20:33:33.0750881Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:33.0750923Z 2025-05-07T20:33:33.0751020Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:33.0751249Z self=, 2025-05-07T20:33:33.0751321Z T=2048, 2025-05-07T20:33:33.0751395Z D=5120, 2025-05-07T20:33:33.0751473Z scale_ub=None, 2025-05-07T20:33:33.0751553Z contiguous=False, 2025-05-07T20:33:33.0751633Z compiled=False, 2025-05-07T20:33:33.0751703Z ) 2025-05-07T20:33:33.0751928Z self = 2025-05-07T20:33:33.0752102Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:33.0752109Z 2025-05-07T20:33:33.0752182Z @given( 2025-05-07T20:33:33.0752335Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:33.0752434Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:33.0752544Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:33.0752656Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:33.0752770Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:33.0752840Z ) 2025-05-07T20:33:33.0753094Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:33.0753228Z def test_silu_mul_quant( 2025-05-07T20:33:33.0753296Z self, 2025-05-07T20:33:33.0753369Z T: int, 2025-05-07T20:33:33.0753444Z D: int, 2025-05-07T20:33:33.0753540Z scale_ub: Optional[float], 2025-05-07T20:33:33.0753626Z contiguous: bool, 2025-05-07T20:33:33.0753706Z compiled: bool, 2025-05-07T20:33:33.0753780Z ) -> None: 2025-05-07T20:33:33.0753873Z torch.manual_seed(2025) 2025-05-07T20:33:33.0753942Z 2025-05-07T20:33:33.0754112Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:33.0756064Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:33.0756075Z 2025-05-07T20:33:33.0756188Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:33.0756193Z 2025-05-07T20:33:33.0756292Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:33.0756516Z self=, 2025-05-07T20:33:33.0756588Z T=4096, 2025-05-07T20:33:33.0756667Z D=7168, 2025-05-07T20:33:33.0756741Z scale_ub=None, 2025-05-07T20:33:33.0756823Z contiguous=True, 2025-05-07T20:33:33.0756903Z compiled=True, 2025-05-07T20:33:33.0756972Z ) 2025-05-07T20:33:33.0757200Z self = 2025-05-07T20:33:33.0757371Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:33.0757376Z 2025-05-07T20:33:33.0757448Z @given( 2025-05-07T20:33:33.0757566Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:33.0757660Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:33.0757818Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:33.0757937Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:33.0758045Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:33.0758122Z ) 2025-05-07T20:33:33.0758377Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:33.0758468Z def test_silu_mul_quant( 2025-05-07T20:33:33.0758547Z self, 2025-05-07T20:33:33.0758616Z T: int, 2025-05-07T20:33:33.0758688Z D: int, 2025-05-07T20:33:33.0758826Z scale_ub: Optional[float], 2025-05-07T20:33:33.0758910Z contiguous: bool, 2025-05-07T20:33:33.0758990Z compiled: bool, 2025-05-07T20:33:33.0759067Z ) -> None: 2025-05-07T20:33:33.0759158Z torch.manual_seed(2025) 2025-05-07T20:33:33.0759230Z 2025-05-07T20:33:33.0759398Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:33.0761395Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:33.0761403Z 2025-05-07T20:33:33.0761519Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:33.0761523Z 2025-05-07T20:33:33.0761619Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:33.0761846Z self=, 2025-05-07T20:33:33.0761984Z T=2048, 2025-05-07T20:33:33.0762056Z D=5120, 2025-05-07T20:33:33.0762142Z scale_ub=1200.0, 2025-05-07T20:33:33.0762224Z contiguous=False, 2025-05-07T20:33:33.0762304Z compiled=False, 2025-05-07T20:33:33.0762374Z ) 2025-05-07T20:33:33.0762595Z self = 2025-05-07T20:33:33.0762771Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:33.0762779Z 2025-05-07T20:33:33.0762850Z @given( 2025-05-07T20:33:33.0762966Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:33.0763063Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:33.0763172Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:33.0763285Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:33.0763395Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:33.0763464Z ) 2025-05-07T20:33:33.0763723Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:33.0763817Z def test_silu_mul_quant( 2025-05-07T20:33:33.0763890Z self, 2025-05-07T20:33:33.0763966Z T: int, 2025-05-07T20:33:33.0764042Z D: int, 2025-05-07T20:33:33.0764135Z scale_ub: Optional[float], 2025-05-07T20:33:33.0764225Z contiguous: bool, 2025-05-07T20:33:33.0764307Z compiled: bool, 2025-05-07T20:33:33.0764380Z ) -> None: 2025-05-07T20:33:33.0764473Z torch.manual_seed(2025) 2025-05-07T20:33:33.0764544Z 2025-05-07T20:33:33.0768423Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:33.0770488Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:33.0770500Z 2025-05-07T20:33:33.0770623Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:33.0770628Z 2025-05-07T20:33:33.0770735Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:33.0770972Z self=, 2025-05-07T20:33:33.0771049Z T=4096, 2025-05-07T20:33:33.0771130Z D=7168, 2025-05-07T20:33:33.0771212Z scale_ub=1200.0, 2025-05-07T20:33:33.0771294Z contiguous=True, 2025-05-07T20:33:33.0771423Z compiled=False, 2025-05-07T20:33:33.0771497Z ) 2025-05-07T20:33:33.0771725Z self = 2025-05-07T20:33:33.0771908Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:33.0771914Z 2025-05-07T20:33:33.0771995Z @given( 2025-05-07T20:33:33.0772117Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:33.0772223Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:33.0772340Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:33.0772464Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:33.0772577Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:33.0772654Z ) 2025-05-07T20:33:33.0772963Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:33.0773057Z def test_silu_mul_quant( 2025-05-07T20:33:33.0773136Z self, 2025-05-07T20:33:33.0773215Z T: int, 2025-05-07T20:33:33.0773291Z D: int, 2025-05-07T20:33:33.0773392Z scale_ub: Optional[float], 2025-05-07T20:33:33.0773480Z contiguous: bool, 2025-05-07T20:33:33.0773566Z compiled: bool, 2025-05-07T20:33:33.0773692Z ) -> None: 2025-05-07T20:33:33.0773787Z torch.manual_seed(2025) 2025-05-07T20:33:33.0773863Z 2025-05-07T20:33:33.0774038Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:33.0776010Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:33.0776017Z 2025-05-07T20:33:33.0776137Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:33.0776142Z 2025-05-07T20:33:33.0776243Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:33.0776481Z self=, 2025-05-07T20:33:33.0776558Z T=16384, 2025-05-07T20:33:33.0776634Z D=7168, 2025-05-07T20:33:33.0776720Z scale_ub=None, 2025-05-07T20:33:33.0776807Z contiguous=False, 2025-05-07T20:33:33.0776892Z compiled=True, 2025-05-07T20:33:33.0776972Z ) 2025-05-07T20:33:33.0777198Z self = 2025-05-07T20:33:33.0777381Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:33.0777385Z 2025-05-07T20:33:33.0777464Z @given( 2025-05-07T20:33:33.0777583Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:33.0777684Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:33.0777802Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:33.0777917Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:33.0778030Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:33.0778107Z ) 2025-05-07T20:33:33.0778368Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:33.0778509Z def test_silu_mul_quant( 2025-05-07T20:33:33.0778586Z self, 2025-05-07T20:33:33.0778667Z T: int, 2025-05-07T20:33:33.0778751Z D: int, 2025-05-07T20:33:33.0778865Z scale_ub: Optional[float], 2025-05-07T20:33:33.0778966Z contiguous: bool, 2025-05-07T20:33:33.0779076Z compiled: bool, 2025-05-07T20:33:33.0779162Z ) -> None: 2025-05-07T20:33:33.0779256Z torch.manual_seed(2025) 2025-05-07T20:33:33.0779334Z 2025-05-07T20:33:33.0779506Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:33.0781509Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:33.0781516Z 2025-05-07T20:33:33.0781633Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:33.0781637Z 2025-05-07T20:33:33.0781740Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:33.0782011Z self=, 2025-05-07T20:33:33.0782088Z T=4096, 2025-05-07T20:33:33.0782169Z D=7168, 2025-05-07T20:33:33.0782257Z scale_ub=None, 2025-05-07T20:33:33.0782343Z contiguous=True, 2025-05-07T20:33:33.0782433Z compiled=False, 2025-05-07T20:33:33.0782506Z ) 2025-05-07T20:33:33.0782911Z self = 2025-05-07T20:33:33.0783214Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:33.0783220Z 2025-05-07T20:33:33.0783298Z @given( 2025-05-07T20:33:33.0783419Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:33.0783518Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:33.0783632Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:33.0783745Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:33.0783856Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:33.0783938Z ) 2025-05-07T20:33:33.0784193Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:33.0784290Z def test_silu_mul_quant( 2025-05-07T20:33:33.0784365Z self, 2025-05-07T20:33:33.0784438Z T: int, 2025-05-07T20:33:33.0784515Z D: int, 2025-05-07T20:33:33.0784612Z scale_ub: Optional[float], 2025-05-07T20:33:33.0784699Z contiguous: bool, 2025-05-07T20:33:33.0784788Z compiled: bool, 2025-05-07T20:33:33.0784864Z ) -> None: 2025-05-07T20:33:33.0784959Z torch.manual_seed(2025) 2025-05-07T20:33:33.0785031Z 2025-05-07T20:33:33.0785201Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:33.0787168Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:33.0787175Z 2025-05-07T20:33:33.0787290Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:33.0787297Z 2025-05-07T20:33:33.0787404Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:33.0787635Z self=, 2025-05-07T20:33:33.0787794Z T=16384, 2025-05-07T20:33:33.0787876Z D=7168, 2025-05-07T20:33:33.0787957Z scale_ub=None, 2025-05-07T20:33:33.0788041Z contiguous=True, 2025-05-07T20:33:33.0788127Z compiled=False, 2025-05-07T20:33:33.0788198Z ) 2025-05-07T20:33:33.0788420Z self = 2025-05-07T20:33:33.0788603Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:33.0788608Z 2025-05-07T20:33:33.0788680Z @given( 2025-05-07T20:33:33.0788799Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:33.0788957Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:33.0789068Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:33.0789186Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:33.0789299Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:33.0789374Z ) 2025-05-07T20:33:33.0789638Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:33.0789729Z def test_silu_mul_quant( 2025-05-07T20:33:33.0789876Z self, 2025-05-07T20:33:33.0789955Z T: int, 2025-05-07T20:33:33.0790031Z D: int, 2025-05-07T20:33:33.0790124Z scale_ub: Optional[float], 2025-05-07T20:33:33.0790213Z contiguous: bool, 2025-05-07T20:33:33.0790357Z compiled: bool, 2025-05-07T20:33:33.0790437Z ) -> None: 2025-05-07T20:33:33.0790528Z torch.manual_seed(2025) 2025-05-07T20:33:33.0790598Z 2025-05-07T20:33:33.0790772Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:33.0792731Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:33.0792799Z 2025-05-07T20:33:33.0792916Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:33.0792921Z 2025-05-07T20:33:33.0793023Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:33.0793251Z self=, 2025-05-07T20:33:33.0793330Z T=16384, 2025-05-07T20:33:33.0793405Z D=7168, 2025-05-07T20:33:33.0793488Z scale_ub=1200.0, 2025-05-07T20:33:33.0793572Z contiguous=True, 2025-05-07T20:33:33.0793651Z compiled=False, 2025-05-07T20:33:33.0793729Z ) 2025-05-07T20:33:33.0793954Z self = 2025-05-07T20:33:33.0794133Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:33.0794138Z 2025-05-07T20:33:33.0794216Z @given( 2025-05-07T20:33:33.0794331Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:33.0794426Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:33.0794540Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:33.0794656Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:33.0794766Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:33.0794839Z ) 2025-05-07T20:33:33.0795097Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:33.0795191Z def test_silu_mul_quant( 2025-05-07T20:33:33.0795268Z self, 2025-05-07T20:33:33.0795341Z T: int, 2025-05-07T20:33:33.0795418Z D: int, 2025-05-07T20:33:33.0795516Z scale_ub: Optional[float], 2025-05-07T20:33:33.0795603Z contiguous: bool, 2025-05-07T20:33:33.0795696Z compiled: bool, 2025-05-07T20:33:33.0795772Z ) -> None: 2025-05-07T20:33:33.0795916Z torch.manual_seed(2025) 2025-05-07T20:33:33.0795992Z 2025-05-07T20:33:33.0796160Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:33.0798117Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:33.0798163Z 2025-05-07T20:33:33.0798277Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:33.0798282Z 2025-05-07T20:33:33.0798385Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:33.0798613Z self=, 2025-05-07T20:33:33.0798687Z T=128, 2025-05-07T20:33:33.0798766Z D=5120, 2025-05-07T20:33:33.0798849Z scale_ub=1200.0, 2025-05-07T20:33:33.0798931Z contiguous=False, 2025-05-07T20:33:33.0799014Z compiled=False, 2025-05-07T20:33:33.0799084Z ) 2025-05-07T20:33:33.0799370Z self = 2025-05-07T20:33:33.0799550Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:33.0799557Z 2025-05-07T20:33:33.0799630Z @given( 2025-05-07T20:33:33.0799747Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:33.0799845Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:33.0799955Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:33.0800480Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:33.0800590Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:33.0800663Z ) 2025-05-07T20:33:33.0800930Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:33.0801021Z def test_silu_mul_quant( 2025-05-07T20:33:33.0801096Z self, 2025-05-07T20:33:33.0801173Z T: int, 2025-05-07T20:33:33.0801247Z D: int, 2025-05-07T20:33:33.0801343Z scale_ub: Optional[float], 2025-05-07T20:33:33.0801434Z contiguous: bool, 2025-05-07T20:33:33.0801516Z compiled: bool, 2025-05-07T20:33:33.0801594Z ) -> None: 2025-05-07T20:33:33.0801688Z torch.manual_seed(2025) 2025-05-07T20:33:33.0801761Z 2025-05-07T20:33:33.0801936Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:33.0802007Z 2025-05-07T20:33:33.0802099Z x_sign = torch.sign(x) 2025-05-07T20:33:33.0802229Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:33.0802316Z x = x_sign * x_clamp 2025-05-07T20:33:33.0802393Z x0 = x[:, :D] 2025-05-07T20:33:33.0802476Z x1 = x[:, D:] 2025-05-07T20:33:33.0802546Z 2025-05-07T20:33:33.0802629Z if contiguous: 2025-05-07T20:33:33.0802723Z x0 = x0.contiguous() 2025-05-07T20:33:33.0802812Z x1 = x1.contiguous() 2025-05-07T20:33:33.0802885Z 2025-05-07T20:33:33.0802979Z if scale_ub is not None: 2025-05-07T20:33:33.0803086Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:33.0803227Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:33.0803300Z ) 2025-05-07T20:33:33.0803374Z else: 2025-05-07T20:33:33.0803469Z scale_ub_tensor = None 2025-05-07T20:33:33.0803539Z 2025-05-07T20:33:33.0803668Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:33.0803763Z op = silu_mul_quant 2025-05-07T20:33:33.0803847Z if compiled: 2025-05-07T20:33:33.0803943Z op = torch.compile(op) 2025-05-07T20:33:33.0804050Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:33.0804167Z 2025-05-07T20:33:33.0804256Z > y_fp8, y_scale = fn() 2025-05-07T20:33:33.0804266Z 2025-05-07T20:33:33.0804361Z moe/activation_test.py:117: 2025-05-07T20:33:33.0804493Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:33.0804600Z moe/activation_test.py:115: in fn 2025-05-07T20:33:33.0804701Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:33.0805246Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:33.0805386Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:33.0805770Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:33.0806008Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:33.0806374Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:33.0806465Z kernel = self.compile( 2025-05-07T20:33:33.0806878Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:33.0807054Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:33.0807223Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:33.0807228Z 2025-05-07T20:33:33.0807442Z self = 2025-05-07T20:33:33.0808294Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:33.0808911Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f58fd00bca0>} 2025-05-07T20:33:33.0809747Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:33.0809949Z context = 2025-05-07T20:33:33.0809953Z 2025-05-07T20:33:33.0810121Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:33.0810396Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:33.0810508Z module_map=module_map) 2025-05-07T20:33:33.0810669Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:33.0810768Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:33.0810846Z E ^ 2025-05-07T20:33:33.0811230Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:33.0811235Z 2025-05-07T20:33:33.0811683Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:33.0811687Z 2025-05-07T20:33:33.0811788Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:33.0812019Z self=, 2025-05-07T20:33:33.0812098Z T=2048, 2025-05-07T20:33:33.0812170Z D=7168, 2025-05-07T20:33:33.0812251Z scale_ub=None, 2025-05-07T20:33:33.0812343Z contiguous=False, 2025-05-07T20:33:33.0812423Z compiled=False, 2025-05-07T20:33:33.0812498Z ) 2025-05-07T20:33:33.0812719Z self = 2025-05-07T20:33:33.0812899Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:33.0812904Z 2025-05-07T20:33:33.0812984Z @given( 2025-05-07T20:33:33.0813101Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:33.0813239Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:33.0813357Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:33.0813474Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:33.0813589Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:33.0813659Z ) 2025-05-07T20:33:33.0813918Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:33.0814012Z def test_silu_mul_quant( 2025-05-07T20:33:33.0814087Z self, 2025-05-07T20:33:33.0814199Z T: int, 2025-05-07T20:33:33.0814281Z D: int, 2025-05-07T20:33:33.0814380Z scale_ub: Optional[float], 2025-05-07T20:33:33.0814467Z contiguous: bool, 2025-05-07T20:33:33.0814555Z compiled: bool, 2025-05-07T20:33:33.0814630Z ) -> None: 2025-05-07T20:33:33.0814721Z torch.manual_seed(2025) 2025-05-07T20:33:33.0814798Z 2025-05-07T20:33:33.0814971Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:33.0816972Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:33.0816981Z 2025-05-07T20:33:33.0817097Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:33.0817101Z 2025-05-07T20:33:33.0817253Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:33.0817481Z self=, 2025-05-07T20:33:33.0817556Z T=128, 2025-05-07T20:33:33.0817637Z D=7168, 2025-05-07T20:33:33.0817720Z scale_ub=1200.0, 2025-05-07T20:33:33.0817802Z contiguous=True, 2025-05-07T20:33:33.0817886Z compiled=True, 2025-05-07T20:33:33.0817956Z ) 2025-05-07T20:33:33.0818182Z self = 2025-05-07T20:33:33.0818364Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:33.0818368Z 2025-05-07T20:33:33.0818442Z @given( 2025-05-07T20:33:33.0818563Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:33.0818662Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:33.0818776Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:33.0818893Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:33.0819007Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:33.0819081Z ) 2025-05-07T20:33:33.0819339Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:33.0819431Z def test_silu_mul_quant( 2025-05-07T20:33:33.0819508Z self, 2025-05-07T20:33:33.0819584Z T: int, 2025-05-07T20:33:33.0819657Z D: int, 2025-05-07T20:33:33.0819752Z scale_ub: Optional[float], 2025-05-07T20:33:33.0819841Z contiguous: bool, 2025-05-07T20:33:33.0819925Z compiled: bool, 2025-05-07T20:33:33.0820003Z ) -> None: 2025-05-07T20:33:33.0820095Z torch.manual_seed(2025) 2025-05-07T20:33:33.0820163Z 2025-05-07T20:33:33.0820338Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:33.0820413Z 2025-05-07T20:33:33.0820503Z x_sign = torch.sign(x) 2025-05-07T20:33:33.0820632Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:33.0820723Z x = x_sign * x_clamp 2025-05-07T20:33:33.0820803Z x0 = x[:, :D] 2025-05-07T20:33:33.0820884Z x1 = x[:, D:] 2025-05-07T20:33:33.0820955Z 2025-05-07T20:33:33.0821038Z if contiguous: 2025-05-07T20:33:33.0821179Z x0 = x0.contiguous() 2025-05-07T20:33:33.0821269Z x1 = x1.contiguous() 2025-05-07T20:33:33.0821346Z 2025-05-07T20:33:33.0821436Z if scale_ub is not None: 2025-05-07T20:33:33.0821539Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:33.0821682Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:33.0821757Z ) 2025-05-07T20:33:33.0821830Z else: 2025-05-07T20:33:33.0821924Z scale_ub_tensor = None 2025-05-07T20:33:33.0822035Z 2025-05-07T20:33:33.0822162Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:33.0822255Z op = silu_mul_quant 2025-05-07T20:33:33.0822339Z if compiled: 2025-05-07T20:33:33.0822438Z op = torch.compile(op) 2025-05-07T20:33:33.0822546Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:33.0822618Z 2025-05-07T20:33:33.0822712Z > y_fp8, y_scale = fn() 2025-05-07T20:33:33.0822716Z 2025-05-07T20:33:33.0822814Z moe/activation_test.py:117: 2025-05-07T20:33:33.0822944Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:33.0823049Z moe/activation_test.py:115: in fn 2025-05-07T20:33:33.0823144Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:33.0823576Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:33.0823671Z return fn(*args, **kwargs) 2025-05-07T20:33:33.0824205Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:33.0824311Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:33.0824691Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:33.0824970Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:33.0825336Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:33.0825429Z kernel = self.compile( 2025-05-07T20:33:33.0825838Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:33.0826020Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:33.0826148Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:33.0826153Z 2025-05-07T20:33:33.0826370Z self = 2025-05-07T20:33:33.0827218Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:33.0827769Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f58fcf3c0d0>} 2025-05-07T20:33:33.0828586Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:33.0828783Z context = 2025-05-07T20:33:33.0828787Z 2025-05-07T20:33:33.0828957Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:33.0829232Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:33.0829342Z module_map=module_map) 2025-05-07T20:33:33.0829505Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:33.0829604Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:33.0829679Z E ^ 2025-05-07T20:33:33.0830204Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:33.0830210Z 2025-05-07T20:33:33.0830657Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:33.0830662Z 2025-05-07T20:33:33.0830768Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:33.0830997Z self=, 2025-05-07T20:33:33.0831080Z T=128, 2025-05-07T20:33:33.0831153Z D=7168, 2025-05-07T20:33:33.0831300Z scale_ub=1200.0, 2025-05-07T20:33:33.0831388Z contiguous=True, 2025-05-07T20:33:33.0831471Z compiled=False, 2025-05-07T20:33:33.0831544Z ) 2025-05-07T20:33:33.0831767Z self = 2025-05-07T20:33:33.0831944Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:33.0831948Z 2025-05-07T20:33:33.0832022Z @given( 2025-05-07T20:33:33.0832143Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:33.0832239Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:33.0832359Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:33.0832473Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:33.0832622Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:33.0832699Z ) 2025-05-07T20:33:33.0832964Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:33.0833059Z def test_silu_mul_quant( 2025-05-07T20:33:33.0833135Z self, 2025-05-07T20:33:33.0833212Z T: int, 2025-05-07T20:33:33.0833287Z D: int, 2025-05-07T20:33:33.0833382Z scale_ub: Optional[float], 2025-05-07T20:33:33.0833512Z contiguous: bool, 2025-05-07T20:33:33.0833595Z compiled: bool, 2025-05-07T20:33:33.0833671Z ) -> None: 2025-05-07T20:33:33.0833765Z torch.manual_seed(2025) 2025-05-07T20:33:33.0833835Z 2025-05-07T20:33:33.0834014Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:33.0834083Z 2025-05-07T20:33:33.0834173Z x_sign = torch.sign(x) 2025-05-07T20:33:33.0834298Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:33.0836268Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:33.0836281Z 2025-05-07T20:33:33.0836399Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:33.0836403Z 2025-05-07T20:33:33.0836504Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:33.0836733Z self=, 2025-05-07T20:33:33.0836810Z T=128, 2025-05-07T20:33:33.0836885Z D=5120, 2025-05-07T20:33:33.0836969Z scale_ub=1200.0, 2025-05-07T20:33:33.0837053Z contiguous=True, 2025-05-07T20:33:33.0837134Z compiled=True, 2025-05-07T20:33:33.0837209Z ) 2025-05-07T20:33:33.0837434Z self = 2025-05-07T20:33:33.0837611Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:33.0837615Z 2025-05-07T20:33:33.0837691Z @given( 2025-05-07T20:33:33.0837806Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:33.0837906Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:33.0838022Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:33.0838135Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:33.0838295Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:33.0838368Z ) 2025-05-07T20:33:33.0838629Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:33.0838725Z def test_silu_mul_quant( 2025-05-07T20:33:33.0838797Z self, 2025-05-07T20:33:33.0838871Z T: int, 2025-05-07T20:33:33.0838948Z D: int, 2025-05-07T20:33:33.0839043Z scale_ub: Optional[float], 2025-05-07T20:33:33.0839130Z contiguous: bool, 2025-05-07T20:33:33.0839261Z compiled: bool, 2025-05-07T20:33:33.0839335Z ) -> None: 2025-05-07T20:33:33.0839426Z torch.manual_seed(2025) 2025-05-07T20:33:33.0839499Z 2025-05-07T20:33:33.0839666Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:33.0839744Z 2025-05-07T20:33:33.0839834Z x_sign = torch.sign(x) 2025-05-07T20:33:33.0839956Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:33.0841949Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:33.0841959Z 2025-05-07T20:33:33.0842074Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:33.0842079Z 2025-05-07T20:33:33.0842186Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:33.0842453Z self=, 2025-05-07T20:33:33.0842526Z T=128, 2025-05-07T20:33:33.0842604Z D=7168, 2025-05-07T20:33:33.0842684Z scale_ub=None, 2025-05-07T20:33:33.0842771Z contiguous=True, 2025-05-07T20:33:33.0842855Z compiled=True, 2025-05-07T20:33:33.0842926Z ) 2025-05-07T20:33:33.0843152Z self = 2025-05-07T20:33:33.0843321Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:33.0843329Z 2025-05-07T20:33:33.0843405Z @given( 2025-05-07T20:33:33.0843524Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:33.0843619Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:33.0843733Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:33.0843857Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:33.0843967Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:33.0844042Z ) 2025-05-07T20:33:33.0844302Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:33.0844393Z def test_silu_mul_quant( 2025-05-07T20:33:33.0844472Z self, 2025-05-07T20:33:33.0844549Z T: int, 2025-05-07T20:33:33.0844621Z D: int, 2025-05-07T20:33:33.0844719Z scale_ub: Optional[float], 2025-05-07T20:33:33.0844803Z contiguous: bool, 2025-05-07T20:33:33.0844889Z compiled: bool, 2025-05-07T20:33:33.0844968Z ) -> None: 2025-05-07T20:33:33.0845063Z torch.manual_seed(2025) 2025-05-07T20:33:33.0845134Z 2025-05-07T20:33:33.0845306Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:33.0847293Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:33.0847302Z 2025-05-07T20:33:33.0847419Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:33.0847552Z =============================== warnings summary =============================== 2025-05-07T20:33:33.0847878Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:33:33.0848192Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:33:33.0848542Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:33:33.0849504Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:108: DeprecationWarning: warmup, rep, and use_cuda_graph parameters are deprecated. See https://github.com/triton-lang/triton/pull/4496 for details. 2025-05-07T20:33:33.0849748Z warnings.warn(("warmup, rep, and use_cuda_graph parameters are deprecated. See " 2025-05-07T20:33:33.0849752Z 2025-05-07T20:33:33.0849974Z -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html 2025-05-07T20:33:33.0850145Z ================= 1 failed, 1 deselected, 3 warnings in 19.38s ================= 2025-05-07T20:33:34.6443316Z ERROR conda.cli.main_run:execute(125): `conda run python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py` failed. (See above for error) 2025-05-07T20:33:34.7073402Z [EXEC] [ATTEMPT 2/2] Command attempt failed. 2025-05-07T20:33:34.7073735Z 2025-05-07T20:33:34.7073971Z [EXEC] The command has failed after 2 + 1 attempts; aborting. 2025-05-07T20:33:34.7074920Z [TEST] Python test suite FAILED for some or all tests despite multiple retries: ./moe/activation_test.py 2025-05-07T20:33:34.7075354Z 2025-05-07T20:33:34.7075358Z 2025-05-07T20:33:34.7075368Z 2025-05-07T20:33:34.7092144Z ##[error]Process completed with exit code 1. 2025-05-07T20:33:34.7173739Z Post job cleanup. 2025-05-07T20:33:34.8164868Z [command]/usr/bin/git version 2025-05-07T20:33:34.8207865Z git version 2.47.1 2025-05-07T20:33:34.8247542Z Copying '/home/ec2-user/.gitconfig' to '/home/ec2-user/actions-runner/_work/_temp/1cb924ec-5521-4663-8170-b1ccd2c7d762/.gitconfig' 2025-05-07T20:33:34.8260072Z Temporarily overriding HOME='/home/ec2-user/actions-runner/_work/_temp/1cb924ec-5521-4663-8170-b1ccd2c7d762' before making global git config changes 2025-05-07T20:33:34.8261117Z Adding repository directory to the temporary git global config as a safe directory 2025-05-07T20:33:34.8265795Z [command]/usr/bin/git config --global --add safe.directory /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM 2025-05-07T20:33:34.8311698Z [command]/usr/bin/git config --local --name-only --get-regexp core\.sshCommand 2025-05-07T20:33:34.8345945Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'core\.sshCommand' && git config --local --unset-all 'core.sshCommand' || :" 2025-05-07T20:33:34.8679958Z Entering 'external/asmjit' 2025-05-07T20:33:34.8751287Z Entering 'external/composable_kernel' 2025-05-07T20:33:34.8825041Z Entering 'external/cpuinfo' 2025-05-07T20:33:34.8891703Z Entering 'external/cutlass' 2025-05-07T20:33:34.8966828Z Entering 'external/googletest' 2025-05-07T20:33:34.9033797Z Entering 'external/hipify_torch' 2025-05-07T20:33:34.9099985Z Entering 'external/json' 2025-05-07T20:33:34.9186304Z [command]/usr/bin/git config --local --name-only --get-regexp http\.https\:\/\/github\.com\/\.extraheader 2025-05-07T20:33:34.9208870Z http.https://github.com/.extraheader 2025-05-07T20:33:34.9220062Z [command]/usr/bin/git config --local --unset-all http.https://github.com/.extraheader 2025-05-07T20:33:34.9251964Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'http\.https\:\/\/github\.com\/\.extraheader' && git config --local --unset-all 'http.https://github.com/.extraheader' || :" 2025-05-07T20:33:34.9579114Z Entering 'external/asmjit' 2025-05-07T20:33:34.9622722Z http.https://github.com/.extraheader 2025-05-07T20:33:34.9666308Z Entering 'external/composable_kernel' 2025-05-07T20:33:34.9710365Z http.https://github.com/.extraheader 2025-05-07T20:33:34.9759864Z Entering 'external/cpuinfo' 2025-05-07T20:33:34.9802260Z http.https://github.com/.extraheader 2025-05-07T20:33:34.9844867Z Entering 'external/cutlass' 2025-05-07T20:33:34.9887459Z http.https://github.com/.extraheader 2025-05-07T20:33:34.9940096Z Entering 'external/googletest' 2025-05-07T20:33:34.9981839Z http.https://github.com/.extraheader 2025-05-07T20:33:35.0025346Z Entering 'external/hipify_torch' 2025-05-07T20:33:35.0067474Z http.https://github.com/.extraheader 2025-05-07T20:33:35.0110953Z Entering 'external/json' 2025-05-07T20:33:35.0152969Z http.https://github.com/.extraheader 2025-05-07T20:33:35.0301882Z A job completed hook has been configured by the self-hosted runner administrator 2025-05-07T20:33:35.0333337Z ##[group]Run '/home/ec2-user/runner-scripts/after_job.sh' 2025-05-07T20:33:35.0343811Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:33:35.0344181Z ##[endgroup] 2025-05-07T20:33:35.0444695Z [!ALERT!] Swap in detected! [!ALERT!] 2025-05-07T20:33:45.8670401Z [!ALERT!] Swap out detected [!ALERT!] 2025-05-07T20:34:02.2153500Z Cleaning up orphan processes